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Problem  statement  and  summary  of  most  important  results:  In  this  project,  we  have  developed  a 
new  Bayesian  framework  for  visual  perception.  The  framework  makes  use  of  bottom-up  computation 
heuristics  (including  salience  maps)  and  top-down  knowledge  (where  high-level  hypotheses  guide  low- 
level  visual  processing).  As  this  yields  complex  computations  and  a  large  search  space  of  hypotheses 
for  interpretation  of  the  visual  data,  we  developed  a  number  of  new  techniques  to  make  the  system 
computationally  tractable.  In  particular,  we  use  probabilistic  techniques  reminiscent  of  recent 
approaches  to  probabilistic  robotics  (including  MCMC,  DDMCMC,  and  particle  filters).  This  project  is 
described  in  details  in  the  following  pages. 

In  addition,  we  have  completed  experiments  to  elucidate  the  relationship  between  cognition  and  visual 
processing.  This  work  provides  important  guidelines  for  further  development  of  our  computational 
vision  frameworks.  The  key  question  addressed  here  is  how  humans  may  re-use  brain  regions 
evolutionarily  associated  with  some  form  of  processing  (e.g.,  vision)  to  serve  other  forms  of  processing 
(e.g.,  algebra,  mental  memorization  and  sorting  of  strings  of  numbers)  which  are  too  recent  on  an 
evolutionary  time  scale  to  have  dedicated  brain  areas.  The  following  pages  also  present  the  results  from 
this  project  in  details. 

Applications  of  our  new  Bayesian  work  on  attention  have  been  rich  and  diverse.  First,  we  have  applied 
our  model  to  search  and  classification  of  targets  (Li  &  Itti,  IEEE  Trans  Image  Proc  2011),  then  to  the 
better  prediction  of  human  eye  movements  when  they  watch  complex  dynamic  scenes  by  using  top- 
down  templates  on  attention  (Li  et  al..  Image  &  Vision  Computing  2011),  combined  bottom-up  and  top- 
down  prediction  of  gaze  and  actions  in  interactive  tasks  such  as  driving  (Borji  &  Itti  CVPR  2012,  Borji 
et  al.,  CVPR  2012),  top-down-guided  visual  search  (Elazary  &  Itti,  Vision  Res  2010;  Baluch  &  Itti 
PlosOne  2012),  robotics  (Siagian  et  al  J  Field  Robotics  2011),  and  attention-based  visual  prostheses 
(Parikh  et  al  J  Neural  Engineering  2010). 

Finally,  our  work  on  this  project  has  allowed  us  to  write  two  review  papers,  one  of  which  in  a  high- 
impact  journal  (Baluch  &  Itti,  Trends  in  Neuroscience  2011,  impact  factor  13.3;  Borji  &  Itti  IEEE 
PAMI 2012). 
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New  Bayesian  framework  for  perception 

Student  supported:  Lior  Elazary 


Abstract 

A  biologically-inspired  framework  for  perception  is  proposed  and  implemented,  which  helps 
guide  the  systematic  development  of  machine  vision  algorithms  and  methods.  The  core  is  a  hierarchical 
Bayesian  inference  system.  Hypotheses  about  objects  in  a  visual  scene  are  generated  "bottom-up"  from 
sensor  data.  These  hypotheses  are  refined  and  validated  "top-down"  when  complex  objects, 
hypothesized  at  higher  levels,  impose  new  feature  and  location  priors  on  the  component  parts  of  these 
objects  at  lower  levels.  To  efficiently  implement  the  framework,  an  important  new  contribution  is  to 
systematically  utilize  the  concept  of  bottom-up  saliency  maps  to  narrow  down  the  space  of  hypotheses. 
In  addition,  we  let  the  system  hallucinate  top-down  (manufacture  its  own  data)  at  low  levels  given 
high-level  hypotheses,  to  overcome  missing  data,  ambiguities  and  noise.  The  implemented  system  is 
tested  against  images  of  real  scenes  containing  simple  2D  objects  against  various  backgrounds.  The 
system  correctly  recognizes  the  objects  in  98.71%  of  621  video  frames,  as  compared  to  SIFT  which 
achieves  38.00%. 


Introduction 


In  the  1970’s  and  80’s  many  researchers  believed  that  we  would  soon  be  able  to  achieve 
machines  that  can  think  and  act  for  themselves.  Although  a  number  of  impressive  algorithms  have 
emerged  to  efficiently  solve  many  artificial  intelligence  problems,  these  machines  never  materialized 
due  to  what  we  think  is  a  failure  in  perception.  For  example,  a  computer  can  beat  any  known  human  in 
chess  if  the  computer  knows  exactly  where  each  of  the  pieces  are  on  the  board.  However,  given  a 
camera,  inferring  the  position  of  the  chess  pieces  remains  very  difficult.  For  that  reason,  we  believe  that 
if  the  perception  problem  were  solved,  many  of  the  machines  promised  in  the  past  would  be  able  to 
materialize.  Therefore,  we  propose  a  framework  for  perception,  which  can  help  guide  the  development 
of  algorithms  and  methods  in  a  more  systematic  manner.  This  will  enable  researchers  to  start 
concentrating  on  particular  problems  in  perception  whether  being  the  structure  of  the  system,  or 
efficient  computations  when  matching  against  a  large  amount  of  data  or  model  representations. 
Additionally,  we  present  a  specific  implementation  for  a  visual  perception  in  the  realm  of  object 
recognition. 


Using  a  visual  sensor  to  recognition  object  has  long  been  a  problem  in  computer  vision.  Many 
algorithms  have  been  developed,  but  they  all  seem  to  fall  shy  of  good  performance.  This  is  often  due  to 
the  fact  that  extracting  low  level  features  from  a  visual  camera  is  often  difficult  and  prone  to  errors.  For 
example,  edge  detectors  often  have  problems  with  noise  edges,  occlusion,  missing  edges  and  returning 
edges  which  do  not  belong  the  the  object  like  specular  edges  (figure  1). 


5 


Figure  1 :  Example  of  an  image  taken  with  a  camera,  and  the  results  of  a  Sobel  edge  detector. 


Humans  seems  to  have  developed  a  solution  to  this  problem  which  has  remained  a  mystery  so 
far.  By  examining  visual  illusions,  we  could  attempt  and  extract  some  of  the  processes  that  are 
occurring  in  the  brain.  Looking  at  figure  2,  we  can  see  that  the  squares  marked  A  and  B  perceptually 
seem  to  have  a  different  shades  of  gray,  where  in  reality  they  have  the  same  shade  of  gray  (pixel  grey 
level  value;  you  can  check  this  by  occluding  the  rest  of  the  display  with  a  piece  of  paper  except  for  a 
small  area  around  A  and  B).  In  this  example,  it  seems  that  the  brain  is  using  high  level  information 
about  the  scene  and  shadows,  to  “fix”  the  low  level  information  about  values  of  the  squares. 


Figure  2:  Checker-shadow  illusion  by  Adelson. 
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Another  visual  illusion  illustrating  “Illusory  Contours”  can  be  seen  in  figure  3.  Many  people  report  a 
white  triangle  on  top  of  a  other  triangle,  even  though  there  are  no  contrast  changes  on  the  white 
triangle.  In  this  image,  the  brain  has  chosen  to  hallucinate  the  edges,  even  though  they  do  not  exist. 


^  v  ^ 


Figure  3:  Kanizsa  triangle  (1955). 

These  illusions  illustrate  that  contrast  plays  an  important  role  in  perception.  Figure  4  shows 
visual  stimuli  in  which  ambiguity  occurs  due  to  missing  or  noisy  local  features.  The  images  show  how 
context  (or  prior  information)  can  help  resolve  these  ambiguities.  The  top  left  figure  shows  an  example 
of  a  much  localized  use  of  context,  where  the  center  squares  will  be  classified  differently  despite  the 
fact  that  they  have  the  same  contrast.  This  is  a  well  know  visual  phenomenon,  which  is  thought  to  help 
in  determining  the  true  intensity  of  various  patches.  This  phenomenon  results  from  the  interaction  of 
neurons  in  the  eye  between  the  centers  and  surrounding  regions.  As  a  result,  the  different  surrounding 
context  changes  the  classification  of  the  center.  Another  example  of  context  use  is  depicted  in  the  top 
right  figure,  where  the  circled  image  blobs  are  exactly  the  same  (except  for  orientation)  but  get 
classified  differently  based  on  their  placement  in  the  scene.  Lastly,  the  bottom  center  figure  shows  an 
ambiguity  between  the  letters  H  and  A  where  only  context  can  help  determine  the  true  identity  of  these 
features. 


In  this  work  we  use  these  illustrations  to  help  and  “fix”  the  low  level  features  and  provide  more 
robust  object  recognition.  In  particular,  we  use  the  concept  of  hallucinations  based  on  priors  to  fill  in 
the  missing  data,  which  results  in  better  recognition.  We  state  that  vision  is  nothing  but  controlled 
hallucinations,  since  most  of  the  time  we  use  the  system  to  correct  and  fix  low  level  features.  To 
achieve  this  we  use  bottom-up  features  as  proposals  and  not  the  data,  and  attempt  to  not  make  any 
major  decision  during  the  bottom-up  feature  extract.  We  then  use  top  down  knowledge  to  provide  the 
decisions  and  impose  priors  on  these  features. 
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Figure  4:  Example  visual  stimuli  in  which  ambiguity  occurs  due  to  missing  or  noisy  local  features.  Top 
left,  the  center  squares  will  be  classified  differently  despite  the  fact  that  they  have  the  same  contrast. 
Top  right,  the  circled  image  blobs  are  exactly  the  same  (except  for  orientation),  but  get  classified 
differently  based  on  their  placement  in  the  scene  (obtained  from  (Torralba  2003)).  Bottom  center, 
ambiguity  between  the  letters  H  and  A. 
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Problem  Formulation 


Figure  5:  Example  scene 

Figure  5  depicts  a  typical  scene  were  a  typical  percept  of  the  scene  could  be  the  location  and 
existence  of  cars.  Previous  approaches  to  solve  this  problem  have  been  to  build  a  model  of  a  car,  and 
use  a  sliding  window  approach  to  search  at  every  location  for  the  car.  However,  this  might  not  be  the 
most  efficient  manner  to  search  through  many  large  images.  Mapping  this  approach  to  the  MCMC 
framework,  we  can  see  how  this  can  be  improved.  In  this  problem  our  goal  is  to  determine  p(W )  where 
W  represents  only  two  aspects  of  the  world:  the  existence  of  a  car  and  its  position.  In  other  words,  we 
wish  to  model  p(Car  =  T  rue/False,  Position  =  x,  y).  Note  that  the  position  is  in  visual  coordinates,  but 
we  may  wish  to  find  the  position  in  3D.  Using  Bayesian  mathematics  we  can  formulate  the  probability 
of  the  world  W  given  an  image  I  in  terms  of  the  likelihood  model  p(I|W )  and  the  prior  information 
P(W). 


W  ^  p(W  |I)  oc  p(l|W  )p(W) 

However,  p(W  |I)  might  have  an  infinite  solution  space  with  complex  structure,  which  would 
make  it  very  difficult  to  solve  in  closed  form.  In  the  MCMC  paradigm  the  system  would  generate  a 
hypothesis  for  a  particular  state  of  the  world  (hypothesize  that  there  is  a  car  at  x=5,  y=6)  and  test  it  with 
a  model  of  a  car  using  the  likelihood  model  p(I|W ).  If  the  posterior  probability  is  modeled  as  a  state 
machine,  where  the  probably  of  being  in  the  current  state  is  only  dependent  on  the  previous  state,  then 
this  is  know  as  a  Markov  Chain.  In  particular: 


P(w\I ,)xP(w\I ,)l  p(W 
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Particle  filtering,  histograms,  or  other  distribution  models  can  then  be  used  to  estimate  the 
posterior  density  function.  This  is  done  by  drawing  samples  from  the  distribution  at  strategic  locations 
(i.e.  the  proposal  distribution  Q(x)).  The  proposal  distribution  can  be  determined  from  previous 
knowledge  embedded  in  the  system.  One  can  therefore  see  that  generating  the  hypotheses  with  a  sliding 
window  approach  amounts  to  searching  every  possible  solution  in  a  discrete  space,  while  using  MCMC 
only  samples  the  search  space  in  specific  places.  Unfortunately,  this  MCMC  can  become  very 
inefficient  if  we  want  to  consider  multiple  objects,  their  full  pose  (x,y,z,  scale,  orientation)  as  well  as 
other  attributes  like  color  and  texture.  As  a  result,  a  better  sampling  technique  will  need  to  be  utilized. 


Using  the  Metropolis-Hastings  (MH)  algorithm,  we  can  build  better  proposal  distributions 
which  can  be  used  for  sampling  based  on  prior  information.  The  algorithm  uses  a  proposal  density 
Q(W';  Wt)  which  depends  on  the  current  state  Wt  to  generate  a  new  proposal  W' .  The  proposal  is  then 
accepted  if: 


,  ,  P[W')Q[W.;W') 

’  P(W,)Q(W';W,) 

Where  U(0,1)  is  a  uniform  distribution  between  0  and  1,  p(W )  is  the  probability  of  our  model  (that  is, 
was  there  a  car  there),  while  Q(W  ' ;  Wt )  is  the  proposal  distribution.  For  example,  if  we  know  that 
cars  are  usually  on  the  ground,  a  proposal  distribution  can  be:  W=N(l/4  image  height,  1/2  image 
height);  where  N  is  the  normal  distribution  with  a  mean  and  variance.  This  would  then  choose  locations 
which  are  closer  to  the  ground  and  test  them  for  the  existence  of  cars.  Note  that  if  the  car  appeared  in 
the  sky,  the  system  will  take  longer  to  find  the  solution. 

Although  the  above  method  of  using  prior  information  would  yield  a  faster  convergence,  it  can  still  be 
improved  further  by  “listening”  to  the  data  in  the  image.  This  can  be  achieved  by  combining  saliency 
maps  with  the  priors  to  generate  better  hypotheses.  A  simpler  method  of  this  has  been  proposed  by  Zhu 
et  al  (2002).  and  is  known  as  data  driven  Markov  chain  Monte  Carlo  (DDMCMC).  As  a  result,  not  only 
locations  around  ground  level  will  be  searched,  but  more  concentration  will  be  applied  on  regions 
which  are  more  salient.  Equation  1 .4  shows  the  modification  to  the  proposal  distribution  to  include 
information  from  the  sensor  I. 


Q{W,;W'\l)aprox 


P{W'\I) 

P{W\I) 


The  DDMCMC  method  have  been  successfully  used  in  the  past  for  image  segmentation  and  scene 
understanding  (Zhu  et  al,  2003)  and  can  deal  with  noise  and  ambiguity.  However,  DDMCMC  does  not 
deal  well  with  missing  data  and  is  still  very  computationally  intensive  for  large  hierarchical 
implementations  of  the  framework.  Fortunately,  research  in  robotic  navigation  has  been  able  to  advance 
the  use  of  particle  filtering  and  geometrical  constraints  to  combat  some  of  these  problems. 
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Visual  Perception  Framework 


The  framework  is  built  from  multiple  hierarchical  Bayesian  inference  modules  that  send  and 
receive  information  to/from  each  other.  Each  module  has  two  types  of  inputs  and  outputs:  Data  which 
is  passed  from  lower  levels  in  the  hierarchy  up,  and  Priors  which  are  passed  from  higher  levels  down 
(Figure  6).  Note  that  the  Data/Prior  can  consist  of  multiple  heterogenous  streams.  For  example,  a 
module  responsible  for  finding  square  contours  in  an  image  will  accept  Data  as  comers  and  edges.  It 
would  then  compute  the  probability  distribution  of  squares  existing  in  particular  locations,  compute  the 
prior  probability  that  the  comers/edges  should  exist  in  particular  locations,  and  bias  the  underlying 
edge  and  corner  modules  to  correct  or  hallucinate  any  missing  or  noisy  edges  or  corners. 


Figure  6:  The  basic  hierarchy  setup  of  modules  (top)  where  solid  line  arrows  represent  feed  forward 
connections  Data  and  dashed  arrows  represent  feedback  Prior.  An  example  triangle  object  where 
comers  and  edges  are  the  features  Fc  and  Fe,  the  pose  of  the  triangle  is  Ot  which  corresponds  to  the 
center  of  the  triangle  can  be  seen  in  the  particle  P[m]  (further  described  in  the  main  text). 
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Each  module  is  responsible  for  a  particular  belief  within  the  system  p(B|D),  where  B  is  the 
belief  in  some  perception  and  D  is  the  evidence  for  that  perception.  For  example,  the  edges  module 
would  maintain  the  belief  of  edges  existing  in  the  world  given  the  values  from  edge  detectors,  such  as 
Gabor  filters.  This  belief  is  computed  using  Bayesian  inference  with  the  ability  to  hallucinate  data  in  a 
controlled  fashion.  Here  we  define  hallucination  as  the  ability  to  manufacture  data  that  is  not  there. 

This  is  different  from  the  view  of  priors,  since  we  let  the  hallucinations  offset  the  Data  rather  than 
scaling  it,  which  results  in  hypotheses  having  positive  values  even  if  the  underlying  data  has  zero 
values.  This  is  important,  since  often  a  particular  module  could  maintain  that  there  is  no  evidence  (e.g., 
for  an  edge  in  a  particular  location)  but  a  higher-level  module  would  insist  on  these  edges  being  there. 
For  an  example  of  this  type  of  behavior  in  humans  see  the  famous  illusory  contours  (Kanizsa.  1955). 
However,  if  the  system  was  allowed  to  hallucinate  without  control,  it  could  always  manufacture  Data  to 
satisfy  its  beliefs.  To  prevent  this  we  use  the  concept  of  surprise  (Itti  and  Baldi  2006),  which  is  the  KL 
difference  between  the  prior  and  the  posterior,  to  break  away  from  “false”  hallucinations. 

In  what  follows  we  focus  on  a  particular  module  that  is  characterized  by  a  set  of  features  F  with  a  pose 
O,  and  which  interacts  with  both  lower-level  modules  (subscript  L)  and  higher-level  modules  (subscript 
H).  We  thus  define  a  particular  belief  in  perception  F  =  {O,  FL  }  in  the  module  of  interest  to  consist  of 
a  pose  O  and  a  subset  of  features  FL  coming  from  lower- level  modules  (e.g.,  O  might  be  the  pose  of  a 
triangle,  FL  would  be  its  three  comers,  and  F  would  be  used  by  higher  modules  as  evidence  for  a 
particular  triangle  in  a  particular  pose;  Figure  6).  That  is,  each  module  models  a  centralized  relationship 
O  for  a  set  of  lower-level  features  FL  and  maintains  it  as  F  .  Each  module  then  computes  the  posterior 
p(F  I  L ).  In  visual  perception  the  pose  O  can  consist  of  2D  or  3D  pose  parameters,  while  the  features  F 
can  be  edges,  comers,  color,  motion,  elementary  shapes,  complex  objects,  etc.  Therefore  the  pose  O  can 
be  seen  as  a  constraint  on  the  arrangement  of  a  set  of  features  FL .  The  posterior  is  calculated  based  on 
the  observation  (Murphy)  that  the  features  are  independent  given  the  pose.  This  insight  allows  the 
posterior  to  be  factored  exactly  as: 

where  1  :t  represent  all  data  obtained  from  time  1  to  time  t  (FI :t  =  FI  ,  F2  , ...,  Ft )  while  i  =  1  •  •  •  n  is 
the  particular  feature  in  the  model.  For  example,  a  model  of  a  triangle  that  consist  of  3  comers  would 
have  a  pose  O  (e.g.,  location  of  center  of  mass,  rotation  angle,  scale  factor)  and  a  set  of  3  features  FL,t 
for  each  comer  in  relation  to  O.  The  independence  assumption  makes  it  possible  to  estimate  the  pose  O 
as  well  as  estimating  the  probability  of  each  feature  FL  .  Where: 


p{n^Ou,>FL^■^  =  np{FJO,.FJp{FJO,,_,,F 


L, \-.t-\i 


Here  p  is  a  constant  which  corresponds  to  the  normalizing  factor  (derived  from  Bayes’  equation). 
p(FL,t  |01:t ,  FL,l:t )  is  then  used  as  a  feedback  to  supply  a  prior  to  the  lower- level  modules.  This  prior 
is  then  added  to  the  likelihood  that  the  module  maintains  p(FL,t  |Ot ,  FL,t ).  Therefore  our  specific 
module  of  interest  would  obtain  its  prior  from  higher  module(s)  H  as: 


H , \:t>  F H ,\:t) 
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where  OH  and  FH  are  the  prior  pose  and  feature  coming  from  a  higher-level  module.  We  use  a  non 
linear  saturation  to  cap  p(FL,t  |Ot ,  FL,t )  at  1  if  its  value  is  greater  then  1.  This  can  be  interpreted  as 
either  taking  the  likelihood  that  the  module  maintains  or  the  prior  from  a  higher  level  to  form  the 
likelihood  for  the  feature  existing  in  a  particular  pose. 

To  ensure  that  the  hallucinations  do  not  produce  too  much  false  data,  the  module  evaluates  how 
surprising  the  addition  of  the  data  has  been,  as  proposed  in  (Itti  and  Baldi  2006).  This  is  achieved  by 
taking  the  KL  distance  between  the  posterior  p(Ft  |FL,t )  and  the  full  prior  p(Ft ): 


Surprise  =  KL{p{F \F ^) ,  p{F ^)) 

If  surprise  surpasses  a  particular  threshold  0,  the  likelihood  p(FL,t  |Ot ,  FL,t )  is  reset  to  a  uniform 
distribution.  Future  work  will  study  how  to  only  correct  the  wrong  hypothesis  or  learn  a  better  model 
for  the  module  to  decrease  surprise  in  the  future.  Note  that  0  is  the  only  parameter  that  needs  to  be 
tuned  for  each  module.  This  parameter  determines  how  quickly  each  of  the  modules  adapts  to  changes 
and  learns  to  better  predict  the  world.  For  example,  if  a  feature  has  moved  from  one  position  to  another, 
a  large  surprise  value  will  occur.  However,  if  another  module  could  have  predicted  the  movement  and 
shifted  the  probability  distribution  accordingly,  then  the  system  is  not  surprised.  Therefore,  learning 
such  a  predictive  model  would  help  explain  the  world  with  better  hypotheses. 

Particle  fdtering  (Montemerlo  et.  al.  2002)  is  then  used  to  efficiently  compute  and  store  the  probability 
distributions  within  the  modules,  which  allows  the  system  to  support  multiple  hypotheses.  However,  in 
some  particular  modules  where  the  probability  distributions  can  be  estimated  from  a  single  Gaussian, 
the  simpler  Extended  Kalman  fdter  is  employed  (Maybeck  1979).  Here  we  describe  the  particle  fdter 
approach  due  to  length  constraints.  Each  particle  Pt  (where  m  is  one  out  of  M  particles)  has  an 
estimated  pose  Ot  and  i,[m]  a  set  of  estimated  features  p(F  )  (refer  to  (Montemerlo  2002,2003)  for 
more  details  on  the  process): 


[d;\p{F'p'). 


In  each  iteration  a  new  pose  for  each  particle  Ot  is  estimated  using  the  features  from  previous  layers  as 
well  as  the  priors.  Bottom-up  computations  help  guide  the  system  toward  probable  “right”  hypotheses. 
The  inspiration  comes  from  (Zhu  et.  al.  2000)  and  (Montemerlo  et.  al.  2003)  and  is  known  in  the 
literature  as  Proposal  Distributions.  Therefore,  the  pose  O  in  the  module  is  sampled  based  on  prior  data 
along  with  the  current  feature  observations  FL  from  lower-level  modules.  This  bottom-up  approach  is 
based  on  two  types  of  computations,  saliency  maps  (Itti  et.  al.  1998,  Itti  and  Koch  2000)  and  the 
Generalized  Hough  Transform  (Ballard  1987).  The  saliency  maps  are  used  to  pick  relevant 
features/locations  for  a  particular  module,  which  are  then  fed  into  a  Generalized  Hough  Transform 
computation  to  estimate  the  pose  as  defined  earlier.  The  saliency  computations  provide  an  efficient 
manner  of  disregarding  information  in  the  presence  of  clutter  and  noise,  since  less  data  needs  to  be 
processed.  It  is  important  to  note  that  the  full  probability  distribution  is  fed  into  the  next  layer  and  not 
just  its  maximum  or  other  single  value  summarizing  the  data.  Therefore,  the  bottom-up  probabilities 
only  guide  the  module  to  the  correct  pose/feature,  but  do  not  make  a  decision  as  proposed  in  other 
bottom-up  approaches  (Serre  et.  al.  2007).  The  reason  for  this  is  that  a  higher  level  in  the  hierarchy 
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might  find  that  less  probable  data  received  in  the  lower  level  is  more  relevant  for  the  current 
perception.  The  proposal  distribution  can  be  reformulated  as  follows: 


where  p  is  a  normalizing  constant  for  particle  m,  p(Ot  |Ot  i )  is  the  prior  distribution  and 


i=\ 


To  evaluate  p(FL,t  |Ot ,  Oi:t-i ,  FL,i:t-i )  we  first  use  saliency  computations  to  filter  out  irrelevant  data.  The 
saliency  computation  differs  based  on  features.  For  example,  the  full  saliency  map  of  (Itti  and  Koch 
2000,  Itti  et.  al.  1998)  can  be  used  to  fdter  out  the  background  and  only  concentrate  on  objects,  while 
the  max-  normalization  of  (Itti  et.  al.  1998)  can  be  used  to  filter  out  edges  that  correspond  to  textures 
(repeating  patterns),  and  only  let  extended  contour  edges  pass  through.  This  level  can  be  seen  as  an 
attention  level,  which  can  also  be  biased  based  on  a  task  as  proposed  in  (Wolfe  1994,  Navalpakkam 
and  Itti  2007)  to  further  reduce  the  amount  of  data.  Once  this  step  is  completed,  the  Hough  transform  is 
used  to  estimate  the  pose  based  on  a  model.  This  enables  an  efficient  running  time  of  0(NM)  where  N 
is  the  number  of  features  that  passed  though  the  saliency  maps  and  M  is  the  number  of  pose  parameters 
to  estimate.  To  sample  from  the  Hough  map  each  module  first  converts  the  map  into  a  probability 
distribution  by  normalizing  the  map,  then  using  the  concept  of  inhibition  of  return  (Itti  et.  al.  1998) 
picks  local  maxima  for  the  sample  pose. 

The  proposal  distribution  does  not  yet  correspond  to  the  posterior  so  we  must  resample  to  evaluate  the 
hypotheses  of  the  particles.  To  account  for  multimodal  distributions,  we  use  stratified  sampling.  The 
importance  weights  are  calculated  as  follow: 


i=  1 

where  p(Fb  |Oi:t-i ,  FL,i:t-i )  is  a  known  model  which  maps  between  feature  positions  and  poses.  Note 
that  currently  we  assume  that  models  of  the  features  are  given  for  particular  objects,  and  the  system 
does  not  learn  models  but  just  evaluates  them.  Therefore,  the  feature  positions  within  a  model  are  never 
updated  since  they  are  known  precisely.  Future  versions  of  the  framework  will  incorporate  learning  for 
new  objects,  which  will  include  updating  the  feature  positions  and  their  model. 


14 


Implementation  of  the  framework  for  visual  perception 

(Version  1.0) 


Figure  7:  The  implementation  of  2D  perception.  See  text  for  details. 


An  implementation  of  the  framework  was  developed  for  visual  perception.  The  modules  are 
inspired  from  the  visual  structures  proposed  by  (Marr  1982  and  Hummel  and  Biederman  1992)  and 
various  visual  cortex  regions  and  their  properties.  The  world  for  this  model  consists  of  shapes  in  a  2D 
world  subjected  to  translations  and  rotations  as  well  as  cluttered  backgrounds.  Figure  7  shows  the 
modules  and  their  feedback. 

The  ganglion  cells  modules  are  responsible  for  p(Wi  |I)  which  computes  the  belief  over 
luminance  at  a  given  location  Wi  in  the  world  given  image  data  I.  This  image  data  comes  from  the 
luminance  value  of  a  camera.  Since  this  probability  distribution  can  be  modeled  by  a  Gaussian,  this 
module  uses  the  Kalman  Filter  equations  for  the  update,  where  the  prior  comes  from  the  VI  cells  layer. 

The  next  layer  in  the  system  computes  the  orientation  and  location  of  edges  and  is  named  V 1 . 
This  is  achieved  by  using  the  Sobel  operator  (Feldman  1968)  as  a  saliency  operator  to  give  evidence  for 
edges.  Since  the  Sobel  operator  can  estimate  the  pose  from  the  fdter  directly,  the  Hough  transform  is 
skipped  for  this  level.  To  model  edges,  Gabor  filters  are  used.  Given  the  2D  nature  of  the  input,  only 
the  position  and  orientation  are  used  for  the  pose.  Since  a  given  location  can  respond  to  multiple  edges 
(e.g.,  comers)  the  particle  filter  approach  is  used.  This  module  is  an  example  of  how  efficient  the 
system  is  at  computing  edges,  since  computing  this  space  with  just  Gabor  fdters  can  be  very  time 
consuming  in  a  serial  process.  However,  the  prior  for  this  module  comes  from  V2,  which  helps  guide 
and  improve  the  hypothesis  of  edges  in  this  layer.  This  is  also  biologically  plausible,  since  the  tuning 
properties  in  primate  VI  cortex  just  following  onset  of  a  stimulus  (co  40  ms)  correspond  to  edge 
orientations,  spatial  frequencies  and  colors,  while  at  a  later  time  (co  lOO  ms)  are  more  sensitive  to 
edges  that  correspond  to  global  properties  in  the  scene  (Lamme  and  Roelfsema  2000). 


15 


VI  then  provides  input  into  the  V2  module,  which  computes  various  Gestalt  features  and  non¬ 
accidental  properties  (Biederman  2000)  as  its  bottom-up  computations.  This  stems  from  responses  of 
many  V2  neurons  in  visual  cortex  corresponding  to  illusory  contours,  and  figure-ground  segregations 
that  can  be  achieved  with  Gestalt  laws  (Qiu  and  Heydt  2005).  This  module  then  computes  the 
probability  that  a  local  edge  belongs  to  an  extended  contour.  This  filters  out  edges  from  textures  or 
background.  In  our  particular  implementation  of  the  model,  the  work  by  (Grigorescu  et.  al.  2003)is 
used  to  improve  the  probability  of  edges  belonging  to  a  contour  based  on  their  neighbors.  Future 
implementation  will  include  a  suite  of  Gestalt  laws.  The  prior  for  this  module  comes  from  V4d  and  V4, 
which  provide  evidence  for  the  contours. 

The  V2  module  provides  information  to  V4d  and  V4,  which  compute  simple  geometric  shapes  like 
squares,  triangles,  circles,  etc,  that  we  define  as  Geons.  These  shapes  are  inspired  by  the  Geon 
hypothesis  that  complex  objects  are  broken  down  into  simpler  geometric  shapes  (Biederman  1987). 

V4d  is  responsible  for  computing  the  vertices  of  shapes  (in  the  case  of  a  circle,  small  arcs  are 
computed),  while  V4  computes  the  outline  of  the  contour.  This  type  of  tuning  response  has  been  found 
to  be  computed  by  the  neurons  of  primate  V4  cortex  (Pasupathy  and  Connor  1999  and  Gustavsen  and 
Gallant  2003).  The  max-normalization  method  proposed  in  (Itti  1998)  is  used  for  saliency  computations 
while  the  generalized  Hough  transform  is  used  to  find  the  pose  for  basic  shapes:  square,  triangle  and 
circle.  The  probability  that  a  shape  exists  given  a  pose  is  evaluated  by  assuming  that  each  vertex  is 
located  within  a  Gaussian  PDF  and  that  the  edges  are  along  a  straight  line  from  each  vertex.  The  prior 
for  this  module  comes  from  the  IT  level,  which  provides  the  probability  that  various  Geons  exists. 

The  last  module  in  the  system,  IT,  computes  the  probability  of  objects  having  particular  poses.  It 
receives  its  input  from  the  V4  module,  and  no  saliency  computations  are  used.  Instead,  the  V4  Geons 
are  fed  directly  into  a  Generalized  Hough  transform  map  from  which  the  proposal  distributions  are 
evaluated.  It  has  been  found  in  the  framework  that  the  higher  the  module  in  the  hierarchy,  the  less 
saliency  computations  are  needed  since  the  amount  of  data  that  passes  through  is  minimal.  The 
probability  of  an  object  given  a  pose  is  evaluated  by  assuming  that  each  V4  Geon  is  located  within 
some  Gaussian  amount  of  uncertainty  in  both  position  and  orientation. 

Results  for  system  version  1.0 


Figure  8:  The  objects  used  in  testing.  From  left  to  right  House,  Woman,  Hat,  Man,  Man  with  Hat. 

The  implemented  system  was  tested  by  using  an  overhead  camera  focused  on  a  scene  containing  simple 
objects.  The  objects  where  composed  of  simple  cardboard  cutouts  of  shapes:  squares,  triangles  and 
circles,  which  were  arranged  into  familiar  shapes  (Figure  8).  Five  objects  were  used  in  the  tests  and 
they  are  House,  Man,  Woman,  Hat  (defined  as  a  triangle  on  top  of  a  circle)  and  Man  with  Hat.  These 
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objects  where  then  subjected  to  various  translation  and  rotations  along  with  complex  background 
changes.  To  compare  against  state  of  the  art,  we  used  the  SIFT  method  proposed  by  (Lowe  2004).  SIFT 
can  be  seen  as  an  implementation  of  this  framework  by  observing  that  the  keypoint  selections  using 
scalespace  extrema  are  analogues  to  our  saliency  computations,  the  evaluation  of  keypoints  against  a 
database  is  a  non-parametric  mapping  function  for  evaluating  hypotheses  (the  keypoint  descriptors  in 
this  case),  and  the  generalized  Hough  transform  is  the  final  output  of  the  system  (an  additional  layer 
with  no  validation).  SIFT  was  trained  on  15  views  of  each  object  (3  shots  at  5  different  poses).  SIFT 
achieved  100%  recognition  rate  on  that  training  set,  confirming  correct  operation  of  the  SIFT 
implementation.  The  results  for  a  test  set  of  621  images  are  in  Table  1. 


Figure  9:  Example  scenes  tested  in  the  model  with  clutter  (hand,  shadows). 


Figure  10:  Input  and  Output  maps  for  one  scene  in  the  implemented  framework.  From  top  left:  Ganglion  input  (camera 
input)  at  left  and  cells  output  at  right,  middle  left:  VI  in/out,  bottom  left:  V2  in/out,  top  right:  V4d  in  +  comers  at  left  and 
particles  at  right,  middle  right:  V4  in  +  Geons  and  particles  at  right,  bottom  right:  IT  with  object  overlay  of  at  left  and 
particles  at  right. 
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Method 

No  clutter  n=342 

Clutter  n=279 

Total  n=621 

SIFT 

41.81% 

33.33% 

38.00% 

Proposed  System 

100% 

97.13% 

98.71% 

Table  1 :  Results  for  the  SIFT  and  the  proposed  system. 

No  temporal  filtering  was  used  during  testing.  This  limited  the  system’s  ability  to  predict  where  an 
object  might  move,  and  to  form  better  hypotheses  (as  in  Isard  and  blake  1998).  This  was  to  determine 
system  robustness  with  isolated  static  scenes.  Adding  temporal  filtering  should  only  improve 
recognition.  Figure  10  shows  an  example  scene  with  all  the  modules  in  the  system,  while  Figure  9 
shows  just  the  IT  output  for  some  scenes.  As  can  be  seen  the  system  correctly  identifies  the  object. 
Examining  the  ganglion  cell  layer,  most  of  the  noise  in  the  image  has  been  eliminated.  Looking  at  the 
V2  layer,  most  of  the  edges  on  the  shadows  have  disappeared  and  only  the  outline  of  the  contours 
remain.  V4d  shows  the  hypothesized  comers  and  their  positions  without  the  feedback.  Even  though 
some  of  the  comers  have  not  been  hypothesized  correctly,  the  system  is  still  able  to  correctly  identify 
the  objects  in  the  scene.  V4  shows  how  the  correct  Geons  are  hypothesized  and  the  particle  positions, 
while  IT  shows  the  correctly  identified  objects  . _  _ 
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Figure  1 1 :  The  responses  of  VI  V2  and  V4  to  a  two  house  scene  with  a  complex  background.  From 
left  to  right:  input  image,  VI  response,  V2  response  with  V4  response  overlaid  on  top. 

Figure  1 1  shows  a  selected  complex  background  with  the  responses  of  VI,  V2,  and  V4.  As  can  be  seen 
there  are  initially  many  false  edges  that  the  system  needs  to  consider.  Just  using  a  standard  Hough 
transform  would  result  in  many  false  positives.  However,  V2  already  cuts  down  on  a  lot  of  the  edges, 
while  the  verification  step  in  V4  is  able  to  narrow  down  the  correct  hypothesis. 
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Implementation  of  the  framework  for  visual  perception 

(Version  2.0) 


Figure  12:  System  version  2.0  modules.  See  text  for  details. 


The  current  framework  has  been  expanded  and  improved  with  a  few  more  modules.  Figure  12.  shows 
the  current  blocks  implemented.  The  module  V2  now  implements  the  tensor  voting  framework 
proposed  by  G.  Medioni  (2004).  This  boosts  the  edges  and  makes  them  more  salient  if  they  are 
continuous.  The  Contours  module  implements  a  simple  edge  following  algorithm  which  prefers  edges 
with  the  same  orientation,  magnitude  and  distance.  The  contours  are  then  approximated  with  lines. 
Lastly  the  2.5D  Sketch  module  provides  2D  shape  recognition  with  similarity  transformation.  This  is 
done  with  a  Generalized  Hough  transform  to  propose  shapes.  The  shapes  are  then  evaluated  using  a 
directional  chamfer  matching  for  efficiency. 
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Figure  13:  The  results  of  the  output  of  VI  for  various  locations  in  the  image  with  a  0  threshold  on  the 
edge  detection  (mark  any  small  magnitude  edge  as  an  edge.  The  left  image  shows  the  full  image  with 
the  attended  320x240  image.  The  middle  image  show  the  edge  detection,  while  the  right  image  shows 
the  proposals  shapes  along  with  their  corresponding  probabilities. 


We  also  explored  a  method  for  tuning  the  various  parameters  in  the  system  using  the  notion  of  bottom- 
up  and  top-down  processes.  This  can  be  seen  as  biasing  various  detectors  to  give  different  results.  One 
classic  example  is  choosing  a  threshold  for  edge  detectors  (Figure  13).  The  system  was  developed  to 
explore  the  parameter  space  using  the  current  framework.  This  was  done  by  setting  the  initial  threshold 
value  to  a  high  number,  which  reduced  the  amount  of  data  to  the  system.  For  example,  setting  the  edge 
threshold  in  VI  produced  less  edges  available  to  V2.  The  system  will  them  propose  a  shape,  evaluate  a 
probability  for  it,  and  send  a  bias  signal  to  reduce  the  threshold  values  in  regions  where  there  are  no 
edges.  This  results  in  two  outcomes.  Either  the  shape  is  found,  in  which  case  the  probability  will 
increase  due  to  the  additional  correct  edges,  or  the  shape  is  not  found  which  will  reduce  or  keep  the 
probability  the  same  (we  determine  that  a  shape  is  found  if  more  the  75%  of  the  edges  contribute  to  the 
match).  The  process  will  then  continue  as  long  as  the  probability  is  increasing,  or  stop  if  its  not.  Even 
though  this  is  an  iterative  process,  about  3  iteration  on  average  is  required  to  extract  the  shape. 


Lastly,  we  improved  the  efficiency  of  the  matching  process  using  Fourier  Descriptors  as  top  down 
features.  The  Fourier  Descriptors  have  been  used  in  the  past  for  shape  analysis  (Charles  et.  al.  1972, 
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Otterloo  1991),  character  recognition(Persoon  and  Fu  1977,  Rauber  1994),  shape  coding  (  Chellappa 
and  Bagdazian  1984),  shape  classification  (  Kauppinen  et.  al.  1995)  and  shape  retrieval  (Guojun  and 
Sajjanhar  1999,  Sajjanhar  1997,  Huang  and  Huang  1998).  Fourier  Descriptors  are  a  method  of 
extracting  a  shape  signature  in  a  form  of  a  vector.  These  signatures  are  often  invariant  to  scale,  rotation 
and  translation,  which  makes  them  very  efficient  at  matching  shapes. 
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Figure  14:  Fourier  Descriptors  formulation. 


Figure  14  shows  how  the  co-ordinates  of  a  boundary  are  collected  into  a  complex  vector  (P*  step).  The 
2"'’  step  assigns  the  Fourier  descriptor  by  essential  giving  a  spatial  frequency  that  fit  the  boundary 
points.  The  first  Fourier  component  (the  d.c.  Component)  is  simply  the  mean  value  of  the  x  and  y  (I.e 
the  center  of  the  boundary).  The  second  component  gives  the  radius  of  a  circle  which  best  fits  the 
points.  Lastly,  higher  order  components  give  more  details  on  the  contour  with  details  occupying  the 
higher  frequencies  (Figure  15). 
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Figure  15:  Fourier  descriptor  components  for  an  airplane  contour.  Top  row  from  left,!*  2th  and  3th 
components.  Bottom  row,  5*  6*  and  the  20* 
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Results  for  system  version  2.0 


The  first  simple  test  consisted  of  a  toy  world  filmed  with  a  camera  using  a  robotic  arm.  In  the  scene,  a 
toy  airplane  was  the  target  as  seen  in  figure  (13,16,  and  17).  Figure  13  shows  an  example  where  the 
threshold  was  set  to  0.  As  can  be  seen,  there  are  many  edges  to  process,  which  took  the  system  15.62 
seconds  on  average  to  extract  and  evaluate  the  shape  proposals.  On  the  other  hand  figure  16  and  17 
show  the  iterative  process  for  extracting  the  shape  for  the  target  object  and  a  non  target  object.  Each 
iteration  took  1.17  seconds,  which  results  in  the  shape  being  extracted  within  3.5  seconds. 


Figure  16:  The  iterative  parameter  tuning  for  edge  selection.  Three  iterations  are  shown  from  top  to 
bottom.  The  green  box  on  the  left  indicated  the  attention  location,  while  the  red  boxes  indicate  the 
biasing  location.  As  can  be  seen,  at  each  iteration  the  proposed  shape  improves  the  probability  of  the 
shape,  which  causes  more  edges  to  be  biased.  At  the  edge  of  the  third  iteration,  the  shape  has  a 
probability  of  more  then  75%,  which  causes  the  system  to  stop  and  mark  the  shape. 
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Figure  17:  The  iterative  parameter  tuning  for  edge  selection.  Three  iterations  are  shown  from  top  to  bottom.  The  green  box 
on  the  left  indicated  the  attention  location,  while  the  red  boxes  indicate  the  biasing  location.  As  can  be  seen,  at  each  iteration 
the  proposed  shape  improves  the  probability  of  the  shape,  which  causes  more  edges  to  be  biased.  However,  the  shape  never 
fully  matches  the  target  which  causes  subsequent  iterations  to  not  increase  the  probability,  terminating  the  process. 
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Figure  18:  Example  images  from  the  ETHZ  dataset. 


The  next  set  of  tests  were  perform  on  the  ETHZ  dataset  apple  logos  (figure  18)  so  it  can  be  compared 
to  the  state  of  the  art  system.  The  dataset  contains  255  test  images  depicting  five  diverse  shape-based 
classes  (apple  logos,  bottles,  giraffes,  mugs,  and  swans)  in  various  scenes. 


Figure  19  shows  and  example  image  with  the  output  of  the  system  before  the  dynamic  biasing  and  after 
the  dynamic  biasing.  As  can  be  seen  there  are  many  false  positives  before  the  dynamic  biasing,  and 
only  one  after.  This  allows  the  system  to  lower  the  threshold  on  the  matching  and  find  more  images 
with  less  false  positives.  Figure  20  shows  the  system  being  run  with  30%  threshold  on  the  edges,  10% 
threshold  and  the  dynamic  biasing.  As  can  be  seen  a  10%  threshold  has  a  lower  detection  rate  then  the 
30%  threshold.  This  is  due  to  the  fact  that  at  10%  threshold  there  are  less  edges.  However,  at  30% 
threshold,  we  get  more  detection  but  also  more  false  positive.  Using  the  dynamic  biasing,  the  system  is 
able  to  take  the  best  of  both  worlds.  It  uses  the  30%  to  lower  the  false  positive  rate,  and  only  lowers  the 
threshold  on  probable  candidates.  Lastly,  figure  21  shows  the  results  of  the  system  in  comparison  with 
other  methods.  As  can  be  seen,  the  dynamic  biasing  is  able  to  have  a  greater  detection  rate  with  less 
false  positives. 
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Figure  19:  Example  system  output  from  a  given  image  (top).  The  middle  row  represent  the  edges,  input 
and  output  before  dynamic  biasing.  The  bottom  row  shows  the  system  after  the  dynamic  biasing. 


Detection  Rate 


Static  vs  Dynamic  biasing 


Figure  20:  Running  the  system  with  fixed  threshold  and  the  dynamic  biasing. 
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Detection  rate 


Apple  logos 


Apple  logos 


Figure  21 :  Results  on  apple  logos  comparing  to  other  algorithms.  Bottom  image  is  a  zoomed  version 
the  top  image. 


Conclusions 


In  this  work,  we  propose  a  biologically-inspired  Bayesian  framework  to  solve  the  perception  problem. 
Importantly,  the  framework  is  helpful  in  designing  and  analyzing  algorithms  for  perception.  For 
example,  the  structure  of  the  modules  proposed  in  this  paper  is  based  on  a  simplistic  view  of  visual 
cortex.  There  are  many  other  regions  in  visual  cortex  that  can  be  utilized  and  tested.  For  example, 
adding  an  MT  layer  can  help  in  forming  fast  and  more  correct  hypotheses.  The  framework  allows  the 
addition  of  such  module  without  having  to  change  the  system.  We  believe  that  this  is  a  trait  that  is 
needed  in  a  framework  of  this  magnitude  since  the  ultimate  structure  for  perception  would  need  to  be 
determined  by  many  researches.  This  would  then  allow  some  researchers  to  work  on  specific  aspects  of 
the  problem  and  give  them  the  ability  to  test  how  it  fits  together  with  other  modules  (or  which  modules 
would  need  to  be  changed).  Additionally,  the  parameter  tuning  has  shown  a  great  improvement  of 
efficiency  as  well  as  a  reduction  of  false  positives  (phantom  shapes).  Currently,  the  generalized  hough 
transform  takes  the  majority  of  the  time,  as  the  system  needs  to  scan  over  4  parameters  (2  for  position, 
scale  and  rotation)  for  each  shape. 

In  the  future  we  are  planning  on  taking  the  implemented  system  to  a  full  3D  and  adding  more  layers 
like  shape  from  shading,  texture,  better  Gestalt  models,  etc.  Additionally,  we  plan  on  implementing 
learning  for  the  framework  so  that  we  can  handle  more  complex  structures.  For  example,  using  the 
same  methods  as  in  the  SLAM  algorithms  we  can  learn  the  probability  of  features  existing  for 
particular  objects.  Lastly,  we  plan  on  studying  how  actions  can  affect  perception  by  biasing  modules  in 
the  system  to  achieve  perception  that  is  relative  to  the  current  task.  For  example,  playing  video  games 
requires  the  understanding  of  the  scene,  but  there  are  many  irrelevant  details  that  do  not  need  to  be 
parsed  or  perceived  in  order  to  win  the  game. 
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Neural  basis  of  mental  cognition 

Supported  student:  Nader  Noori 


Introduction 

S5mibolic  problem  solving  has  proved  to  be  powerful  in  solving  practical  problems  ranging  from 
counting  one's  finger  to  building  intelligent  agents.  Cognition  by  means  of  abstract  symbolic  concepts 
in  an  algorithmic  manner  is  one  of  the  tenets  of  mathematical  cognition.  Identifying  the  relationship 
between  this  evolutionarily  newly  emerged  symbolic  machinery  and  rudimentary  older  modal  systems 
has  motivated  numerous  studies  mostly  focused  on  grounding  representation  of  S3mibolic  concepts. 
However  recent  evidences  emerging  from  neuroimaging  and  patient  studies  suggest  that  modal  systems 
for  visually  guiding  actions  in  space  play  a  role  in  mental  operations  on  S5mibolic  information  that  is 
beyond  representation  of  symbolic  concepts  (Koenigs  M.  et  al  2009,  Knops  A.  et  al.  2009). 

Motivated  by  these  findings  we  have  been  seeking  a  grounded  mechanistic  model  for  algorithmic 
controlled  information  processing  in  human  brain.  We  studied  the  impacts  of  mental  tasks  with 
symbolic  and  presumably  non-visual  tasks  on  the  visual  system.  We  learned  that  mental  tasks  with 
symbolic  content  that  with  demand  for  active  memory  manipulation  imposes  a  load  on  oculomotor 
system  and  impairs  the  visuospatial  short  term  memory. 

To  explain  our  findings  we  propose  a  critical  role  for  a  spatially  organized  short-term  memory  which  is 
used  for  anchoring  task  relevant  items  into  the  space.  These  anchors  are  used  for  selective  processing 
of  the  maintained  information.  Selective  processing  of  information  (such  as  deletion  of  item  from 
memory)  in  turn  is  made  possible  through  shifts  in  spatial  attention  towards  registry  location  of  the 
item  of  interest  in  the  space. 

This  registry  system  along  with  an  articulatory  system  for  hashing  items  into  phonological  codes,  and  a 
system  for  performing  and  monitoring  sequential  actions  provide  necessary  mechanisms  for  employing 
overly-trained  networks  for  processing  limited  set  of  activated  items  in  arbitrary  algorithms.  We  have 
evaluated  our  hypothesis  by  detecting  process  related  traces  of  mental  symbolic  operations  in  both  eye 
movements  of  human  subjects  and  visuospatial  short-term  memory  of  objects  in  the  environment. 

Eye  tracking  experiments 

We  studied  distributions  of  low  amplitude  gaze  shifts  made  during  a  single  mental  sorting  task  in  front 
of  a  blank  sereen  and  notieed  that  unlike  aetive  maintenanee  of  a  string  of  decimal  digits,  sorting  them 
into  order  induces  gaze  shifts  that  carry  information  about  the  mental  stimulus  for  the  sorting  (Figure 
1 .) .  For  example  we  noticed  that  strings  of  digits  with  flipped  order  of  items  result  to  symmetric 
distributions  of  gaze  shifts  (Figure  2.).  We  interpreted  the  presence  of  the  information  about  the  mental 
executive  task  in  the  oculomotor  system  as  the  result  of  the  active  involvement  of  the  visuospatial 
system  in  the  manipulation  of  memory  items. 

To  give  an  account  for  the  involvement  of  the  visuospatial  system  in  the  process  of  memory 
manipulation  we  proposed  the  Spatial  Registry  Hypothesis  (SRH)  which  assumes  a  functional  role 
for  brain  regions  with  visual-spatial  encoding  features  in  registering  memory  items  in  a  spatially- 
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organized  short-term  memory.  We  assume  that  a  task-relevant  items  in  the  working  memory  may 
register  with  a  corresponding  visuospatial  short-term  memory  (VSSTM).  The  spatial  registry  may 
occur  when  selective  access  to  memory  items  is  required.  This  access  might  be  facilitated  by  shifting 
spatial  attention  in  this  internal  registry  space  and  this  would  give  an  account  for  the  induced  gaze 
shifts  during  the  mental  executive  tasks. 

This  assumption  is  consistent  with  findings  of  neuroimaging  and  neuropsychological  studies  that  have 
shown  that  the  posterior  region  of  the  parietal  lobe  is  critically  engaged  during  executive  memory  tasks 
(Osaka  et  al.  2007,  Olson  et  al.  2009,  Knops  et  al.  2009).  In  particular,  the  superior  parietal  lobule 
(SPL)  which  is  known  for  its  visuospatial  tuning  is  shown  to  be  critically  involved  in  the  memory 
manipulation  (Koenigs  et  al.  2009).  Moreover,  in  a  series  of  fMRI  studies,  Yantis  and  his  colleagues 
have  shown  that  shifting  of  external  attention  between  spatial  locations  and  shifting  attention  in 
mnemonic  domains  show  overlap  across  a  fronto-parietal  network  Shomstein  et  al.  2006, 
Tamberrosenau  et  al.  2011, Chiu  et  al.  2009). 

The  result  of  these  experiments  have  appeared  in  the  following  publications: 

1.  N.  Noori,  L.  Itti,  Spatial  Registry  Model:  Towards  a  Grounded  Account  for  Executive  Attention, 
In:  Proc.  Conference  on  Cognitive  Science  (CogSci  2011),  pp.  1-6,  Jul  2011. 

2.  N.  Noori,  L.  Itti,  Eye-Movement  Signatures  of  Abstract  Mental  Tasks,  In:  Proc.  European 
Conference  on  Cognitive  Science  (EuroCogSci  2011),  (B.  Kokinov,  A.  Karmiloff-Smith,  N.  J. 
Nersessian  Ed.),  pp.  110:1-110:6,  May  2011. 

3.  N.  Noori,  L.  Itti,  Visuospatial  attention  shifts  during  non- visual  mental  tasks.  In:  Proc.  Vision 
Science  Society  Annual  Meeting  (VSSl  1),  May  2011 

Interference  of  executive  memory  tasks  with  Visuospatial  Short  Term  Memory 
Our  Spatial  Registry  Hypothesis  (SRH)  predicts  that  visuospatial  short-term  memory  might  be 
implicated  in  those  mental  tasks  that  require  manipulation  of  information.  Thus  their  hypothesis  implies 
that  engaging  with  a  mental  executive  task  that  requires  memory  manipulation  not  only  impairs  the 
visual  perception  but  also  will  affect  the  spatial  memory  of  previously  perceived  visual  items.  In 
particular  SRH  claims  that  an  impact  of  a  secondary  non- visual  mental  task  is  independent  of  executive 
attentional  load  and  selective  to  regions  of  space  that  is  being  used  for  the  spatial  registry. 

Standard  models  of  working  memory  explain  the  impact  of  mental  tasks  on  the  visual  system 
commonly  in  terms  of  a  bottleneck  in  executive  attentional  resources  that  are  needed  for  both  mental 
and  visual  tasks.  However  depends  on  one's  assumption  about  the  involvement  of  executive  attentional 
resources  in  maintaining  visuospatial  short-term  memory  (VSSTM),  an  attentional-bottleneck  account 
would  predict  either  no  impact  or  a  uniform  impact  of  a  secondary  executive  memory  tasks  on 
VSSTM. 

In  these  studies  we  investigated  the  possible  involvement  of  the  visuospatial  short-term  memory 
(VSSTM)  in  the  manipulation  of  information  in  executive  working  memory  tasks.  We  showed  that 
performing  a  non-visual  executive  working  memory  task  concurrent  with  a  simple  visuospatial  memory 
task  impairs  the  spatial  short-term  memory  of  visual  targets  selectively  and  independent  of  the  load  of 
the  mental  task.  (Figures  3.  and  4.) 

We  also  verified  this  hypothesis  by  showing  that  the  task-irrelevant  spatial  binding  of  items  in  a 
forward-or-backward  recall  task  has  a  significant  effect  on  the  backward  recall  with  demand  for 
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memory  manipulation  while  it  has  no  significant  impact  on  the  forward  recall  with  no  demand  for 
memory  manipulation. 

In  our  first  experiment  we  observed  that  subjects  demonstrate  a  disadvantage  in  retaining  visuospatial 
information  along  the  vertical  direction  (the  ignore  condition  of  the  first  experiment )  (Figure  3.) . 
Therefore,  if  registering  task-relevant  items  with  the  visuospatial  short-term  memory  play  a  functional 
role  in  the  process  of  memory  manipulation,  one  could  imagine  that  using  vertical  direction  for 
registering  the  items  might  result  to  a  disadvantage  in  bookkeeping  of  memory  items  and  hence  might 
come  into  cost  for  the  performance  of  the  executive  memory  task. 

In  our  last  experiment  we  investigated  this  prediction  by  priming  subjects  along  two  task-irrelevant 
orientations  and  monitoring  their  performance  during  an  executive  memory  task  . 

We  measured  the  performance  of  our  subjects  during  a  forward-or-backward  recall  task  in  which 
subjects  first  read  some  items  from  the  screen  without  knowing  whether  they  have  to  be  recalled  in  a 
forward  or  a  backward  order.  We  measure  the  performance  in  both  backward  recall  (executive  memory 
task)  and  forward  recall  (active  maintenance).  The  forward  recall,  presumably  draws  only  on  the 
phonological  loop  and  thus  the  presentation  method  is  not  supposed  to  have  any  effect  on  the  the 
performance.  However  backward  recall  requires  memory  manipulations  and  therefore  is  potentially 
sensitive  to  presumably  task-irrelevant  visual  presentation  method  (Figure  5.). 

The  result  of  these  studies  will  appear  in  the  following  publication: 

N.  Noori,  L.  Itti,  Selective  Impact  of  Mental  Abstract  Tasks  on  Visuospatial  Short-Term  Memory,  In: 
Proc.  Vision  Science  Society  Annual  Meeting  (VSSll),  May  2012 
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(a)  Sorting  :  Horizontal  -  Vertical  (b)  Maintaining  :  Horizontal  -  Vertical 


(c)  Sorting  :  Random  -  Horizontal  (d)  Sorting  :  Random  -  Vertical 


Figure  1.  Comparing  the  impact  of  active  maintaining  versus  mental  sorting  on  direction  of 
gaze  shifts.  Each  panel  shows  the  subtraction  of  normalized  distribution  of  directions  of  gaze 
shifts  for  a  certain  type  of  task  and  two  different  initial  presentation  method  (a)  when  the 
mental  task  is  sorting,  the  direction  of  initial  presentation  influences  the  direction  dominant 
direction  of  gaze  shifts,  (b)  the  initial  presentation  has  no  impact  of  the  directions  of  eye 
movement  during  active  maintaining,  (c)  during  mental  sorting  horizontal  presentation  and 
random  presentation  of  items  have  the  same  ejfect. 
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Figure  2.  Sequences  of  items  which  are  symmetric  induce  symmetric  gaze 
shifts  during  the  sorting  task.  Each  graph  shows  the  difference  between 


averages  of  normalized  amplitude  distributions  of  gazes  for  two  symmetric 
sets  of  stimuli.  On  the  top  panel,  the  result  for  stimuli  of  type  1  -  stimuli  of 
type  2  (canonically  represented  by  34012  and  21043)  is  shown.  The  bottom 
panel  shows  the  result  for  stimuli  of  type  3-  stimuli  of  type  4  (canonically 
represented  by  41230  and  03214). 
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Figure  3. Performing  a  mental  sorting  task  impairs  the 
visuospatial  short-term  memory  along  the  horizontal 
direction  while  it  has  no  effect  on  the  visuospatial  short-term 
memory  along  the  vertical  direction. 
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Figure  4. Performing  a  double  counting  task  impairs  the 
visuospatial  short-term  memory.  Howeverthis  impact  seems  to  be 
independent  of  the  rate  of  counting. 
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Figure  5.  The  impact  of  the  presentation  direction  on 
the  errors  of  the  forward  and  backward  recall  during  a 
forward-or-backward  recall  task. 
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Abstract 

Training  has  been  shown  to  improve  perceptual  performance  on  limited  sets  of  stimuli.  However,  whether  training  can 
generally  improve  top-down  biasing  of  visual  search  in  a  target-nonspecific  manner  remains  unknown.  We  trained  subjects 
over  ten  days  on  a  visual  search  task,  challenging  them  with  a  novel  target  (top-down  goal)  on  every  trial,  while  bottom-up 
uncertainty  (distribution  of  distractors)  remained  constant.  We  analyzed  the  changes  in  saccade  statistics  and  visual 
behavior  over  the  course  of  training  by  recording  eye  movements  as  subjects  performed  the  task.  Subjects  became  experts 
at  this  task,  with  twofold  increased  performance,  decreased  fixation  duration,  and  stronger  tendency  to  guide  gaze  toward 
items  with  color  and  spatial  frequency  (but  not  necessarily  orientation)  that  resembled  the  target,  suggesting  improved 
general  top-down  biasing  of  search. 
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Introduction 

Bottom-up,  stimulus-driven  processes  as  well  as  top-down,  goal- 
driven  processes  exert  influence  on  perception  and  therefore  on 
the  ability  to  perform  visual  tasks.  Experts  in  a  wide  range  of  fields 
[1],  from  radiologists  detecting  tumors  [2],  image  analysts 
screening  baggage  at  the  airport  [3] ,  pilots  scanning  their 
instrument  panel  [4],  to  chess  grand  masters  [5]  rely  on  their 
perceptual  discrimination  and  selection  abilities  to  make  judge¬ 
ments  often  in  life  threatening  situations.  Tasks  performed  by  these 
experts  rely  on  both  bottom-up  and  top-down  processes  to  search 
for  and  direct  attention  towards  features  of  the  image  that  are 
crucial  to  enabling  perceptual  judgement  with  confidence.  The 
central  question  in  this  study  is  whether,  and  to  what  extent, 
training  and  expertise  improve,  or  otherwise  modify,  how  rapid 
top-down  goal-driven  tuning  of  visual  processing  can  enhance 
visual  information  for  perceptual  decisions,  specially  in  feature  rich 
enviornments. 

Guidance  of  visual  search  for  features  in  an  image  by  top-down 
processes  poses  a  constant  demand  on  the  visual  and  attentional 
systems  to  convert  descriptions  of  desired  target(s),  which  may 
change  from  moment  to  moment  depending  on  behavioral  goals, 
into  appropriate  guiding  signals  that  can  facilitate  localization  of  a 
target.  The  quality  of  the  guidance  is  determined  by  a  number  of 
factors  including,  i)  the  properties  of  the  tuning  functions  of  the 
sensory  system  [6] ,  ii)  the  ability  of  the  sensory  system  to  eliminate 
noise  [7] ,  and  iii)  the  discriminability  of  the  target  from  distractors 
and  background  clutter  (signal-to-noise  ratio).  On  a  short  time 
scale,  attention  can  enhance  guidance  through  enhanced  gain  [8], 
enhanced  spatial  resolution  [9],  effective  stimulus  strength  [10],  or 
noise  exclusion  [7].  Analogous  effects  have  been  observed  in 


perceptual  learning  studies  over  a  longer  time  scale  of  up  to  a  few 
days  or  longer. 

Perceptual  learning  studies  have  shown  that  practice  can 
improve  performance  in  discrimination  [11—14]  and  detection 
[15,16].  These  studies  have  shown  improvement  in  either  a 
spatially  or  featurally  specific  manner  and  thus  implicated  early 
sensory  cortex  as  the  locus  of  plasticity  and  this  has  also  been 
observed  in  electrophysiological  studies  [17,18].  Although  most 
studies  limit  their  training  to  either  specific  spatial  locations  or 
specific  stimulus  feature  ranges,  there  has  been  some  speculation 
about  mechanisms  of  more  general  improvement  in  tasks.  Some 
studies  for  example,  have  implicated  the  higher  cortex  [19—21]  in 
learning.  Plasticity  effects  have  been  observed  in  later  visual  areas, 
namely  V4  and  FEE  (frontal  eye  fields),  as  a  result  of  perceptual 
learning  [22,23].  Learning  in  tasks  such  as  visual  search  has  also 
been  shown  to  be  less  specific  [24].  Sireteanu  et  al.  [25]  have 
shown  non-specificity  of  perceptual  learning  effects  specially  in 
visual  search  tasks,  and  thus  placed  the  locus  of  plasticity  for 
learning  a  visual  search  task  at  a  higher  level  than  sensory  cortices. 
One  question  which  has  remained  outstanding,  however,  is 
whether  training  can  improve  the  effectiveness  of  the  dynamic 
top-down  attention  biasing  process  itself  through  what  has  been 
termed  process-based  learning  [26],  as  opposed  to  exhibiting 
sharper  visual  discrimination  abilities  for  a  specific  type  of  target  or 
location  (perceptual  learning  or  automaticity  through  better 
memory  retrieval  [26]),  or  generally  improving  speed  and/or 
performance  on  a  task  (task  acquisition  for  search).  This  type  of 
non-specific  learning  remains  understudied  and  more  specifically, 
the  pairing  of  learning  within  a  visual  search  task  to  observe  the 
effects  of  training  top-down  attention  remains  relativity  unex¬ 
plored  (although  see  [27]). 
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In  this  study  we  address  the  question  of  whether  expertise  ean 
be  gained  in  a  triple-eonjunetion  (eolor,  spatial  frequeney,  and 
orientation)  seareh  task  when  both  the  features  and  spatial 
loeation  of  the  target  are  ehanged  from  trial  to  trial  while 
maintaining  a  persistent  level  of  bottom  up  uneertainty  in  the 
Shanon  entropy  sense.  This  imposes  a  novel  and  interesting  new 
eonstraint  on  the  type  of  learning  that  ean  oeeur,  eliminating  the 
eases  of  (pereeptual)  learning  due  to  ‘stimulus  imprinting’  [28] 
and  foeusing  on  what  Goldstone  [28]  has  termed  ‘attention 
weighting’.  Speeifieally,  this  type  of  paradigm  makes  a  demand 
on  the  observers  to  make  fast  trial-by-trial  adjustments  of  top- 
down  biasing  weights  in  order  to  sueeeed  in  the  seareh  task.  We 
also  ask  what  differenee,  if  any,  training  makes  on  the  subjeets’ 
saeeadie  eye  movements  and  the  types  of  distraetors  that  they 
look  at.  This  is  a  departure  from  a  typieal  learning  paradigm 
where  the  stimulus  set  is  often  restrieted  in  either  spaee  or  feature 
set.  We  look  for  meehanisms  of  aequisition  of  general  domain 
expertise  when  the  observers  are  given  a  task  that  requires 
attention  to  the  stimulus  in  order  to  aehieve  sueeess.  By 
analyzing  eye  movements  we  ean  ensure  that  effeets  beyond 
general  task  aequisition  are  eaptured.  Changing  the  target  on 
eaeh  trial  puts  the  spotlight  on  meehanisms  of  attentional  biasing 
eflfieaey  rather  than  simple  pereeptual  learning.  We  hypothesized 
that  better  biasing  would  lead  to  inereased  guidanee  towards 
items  that  are  similar  to  the  target  as  the  biasing  proeess  would 
render  items  sharing  features  with  the  target  more  salient.  Thus 
the  number  of  items  that  were  viewed  need  not  neeessarily  be 
redueed  but  the  quality  of  the  set  may  improve.  An  alternate 
outeome  would  be  that  subjeets  view  a  smaller  number  of  items 
whieh  would  suggest  a  trend  toward  automatieity  or  more  pre- 
attentive  guidanee. 

We  show  that  learning  oeeurs  even  when  the  target  is  ehanged 
in  both  features  and  spatial  loeation  on  every  trial.  The 
improvement  is  marked  by  a  deerease  both  in  intersaeeadie 
interval  (ISI)  and  reaetion  time.  The  deerease  in  ISI  suggests  an 
improvement  in  diserimination  and  a  stronger  emphasis  on  the 
seleetion  (deteetion)  task.  However,  we  did  not  observe  a 
signifieant  drop  in  saeeade  eounts  whieh  suggests  that  the 
improvement  in  seleetion  was  limited  to  improving  the  ‘quality’ 
of  the  subset  of  items  on  the  display  that  are  serutinized  (the  size  of 
the  subset  remaining  fairly  eonsistent).  We  also  find  that  subjeets 
tend  to  exploit  two  of  the  three  features  of  the  stimuli,  making 
saeeades  towards  items  that  are  similar  to  the  target  in  eolor  and 
spatial  frequeney  but,  interestingly,  not  neeessarily  in  orientation. 

In  sum,  our  results  provide  evidenee  for  a  meehanism  of 
expertise  aequisition  that  is  driven  by  produetion  of  better  top- 
down  biasing  signals,  the  behavioral  eorrelate  of  whieh  is  the 
inereased  similarity  effeet  observed.  This  eoupled  with  improved 
diserimination,  likely  driven  by  multiple  exposures  to  the  family  of 
stimuli  used  in  the  task,  define  the  enabling  meehanisms  that  allow 
the  transition  from  noviee  to  expert. 

Methods 

Ethics  Statement 

Subjeets  gave  written  eonsent  under  a  protoeol  approved  by  the 
Institutional  Review  Board  of  the  University  of  Southern 
California,  and  were  paid  for  partieipating  in  the  study. 

Subjects 

Human  subjeets  reeruited  for  this  study  were  undergraduate 
and  graduate  students  at  University  of  Southern  California. 
Subjeets  ineluded  four  males  and  one  female  aged  21-26  years.  All 
subjeets  had  normal  or  eorreeted  vision.  Subjeets  gave  written 


eonsent  under  a  protoeol  approved  by  the  Institutional  Review 
Board  of  the  University  of  Southern  California,  and  were  paid  for 
partieipating  in  the  study.  Subjeets  were  naive  to  the  purpose  of 
the  experiment  and  had  never  seen  any  of  the  stimuli  before. 

Stimuli 

A  set  of  eolored  Gabor  patehes  were  designed  for  this 
experiment,  whieh  provided  the  ability  to  vary  features  along 
three  dimensions:  eolor,  spatial  frequeney,  and  orientation.  The 
luminanee  profile  of  eaeh  Gabor  pateh  is  given  by  the  following 
equation: 

7  7 

g(x,y,e,<l>)=e^  g(2,r^i(xcosfl+j,smfl)) 

where  6  is  the  orientation  of  the  pateh,  (j)  is  the  spatial  frequeney. 
Eaeh  pateh  subtended  4°  of  visual  angle.  The  phase  of  the  sinusoid 
at  eaeh  point  was  used  to  modulate  the  eolor  of  the  pixels  along  the 
hue  axis  in  the  HSV  eolor  spaee,  as  shown  in  figure  la.  By  sliding 
a  window  along  the  hue  axis,  the  range  of  eolors  in  the  pateh  was 
ehanged,  thus  modifying  the  appearanee  of  the  pateh.  The 
window  spanned  from  0  to  360  and  a  hue  shift  essentially 
reeentered  the  window  around  a  given  value.  Eaeh  Gabor  pateh 
was  then  defined  by  its  spatial  frequeney  whieh  ranged  from  1.7  e/ 
deg  to  5.2  e/deg,  orientation,  whieh  ranged  from  25°  to  155°,  and 
finally  a  eolor  hue  value  that  determined  the  shift  of  the  hue 
window. 

Seareh  arrays  were  eonstrueted  from  32  Gabor  patehes 
embedded  in  1/f  noise  in  a  4x8  grid,  with  slight  spatial  jitter  (1° 
along  the  x  or  y  direetion)  applied  to  eaeh  pateh.  One  of  the  Gabor 
patehes  was  randomly  ehosen  as  the  target  for  eaeh  seareh  array. 

Paradigm 

Subjeets  eondueted  1,000  trials  of  visual  seareh  over  the 
eourse  of  ten  eonseeutive  days.  Eaeh  day  eonsisted  of  a  session  of 
100  trials  with  a  break  after  50  trials.  Stimuli  were  presented  on 
a  large  (1920  x  1080  pixels)  LCD  monitor  (Sony  Bravia  XBR-III) 
and  subjeets  were  seated  in  a  eomfortable  ehair  with  their  head 
stabilized  by  a  ehin  rest.  The  viewing  distanee  was  97.8  em, 
eorresponding  to  a  field  of  view  of  54.8°  x  32.7°.  A  typieal  trial, 
as  illustrated  in  figure  lb,  began  with  a  fixation  eross  at  the 
eenter  of  the  display  followed  by  a  2  seeond  target  preview, 
presented  at  the  eenter  with  a  gray  baekground.  The  gray  value 
of  this  baekground  was  equal  to  the  mean  gray  of  the  1  //  noise 
of  the  eorresponding  seareh  array  display.  Subjeets  were 
instrueted  to  find  the  target  as  fast  and  aeeurately  as  possible 
and  had  a  maximum  of  ten  seeonds  to  find  the  target.  Their  eye 
movements  were  reeorded  as  they  searehed  for  the  target  (see 
below  for  eye-traeking  methods).  Upon  loeating  the  target, 
subjeets  pressed  a  response  button,  at  whieh  point  the  seareh 
array  disappeared.  A  display  eonsisting  of  numbers  that 
eorresponded  to  the  Gabor  pateh  loeations  was  then  displayed 
for  200ms.  Subjeets  had  to  read  and  key-in  the  number  at  the 
loeation  of  the  target  using  a  keyboard.  The  font  size  was 
suffieiently  small  that  one  eould  not  read  the  numbers 
eorresponding  to  one  Gabor  pateh  while  fixating  at  the  loeation 
of  any  other  Gabor  pateh.  The  goal  of  this  ‘no  eheat’  proeedure 
was  to  ensure  that  subjeets  reported  eorreetly  the  pateh  whieh 
they  thought  was  the  target(for  more  details  on  this  proeedure 
see  [29]).  After  subjeets  provided  input,  they  were  given  feedbaek 
as  a  ‘eorreet’  or  ‘ineorreet’  response,  as  well  as  the  eurrent  level 
of  performanee  (%  eorreet  responses  so  far).  Eaeh  session  lasted 
approximately  45  minutes. 
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Figure  1.  Stimulus  and  Paradigm,  (a)  Color  Gabor  patches  constructed  by  first  applying  a  gaussian  envelope  over  a  sinusoid  as  shown.  At  each 
point  the  phase  of  the  sinusoid  was  used  to  modulate  a  hue  axis  in  the  HSV  color  space,  (b)  A  trial  started  with  a  two-second  target  preview  followed 
by  a  display  of  the  search  array  for  a  maximum  of  ten  seconds.  If  subjects  found  the  target  before  the  1 0  seconds  elapsed  they  hit  a  key  to  move  to  the 
next  display.  The  next  display  showed  numbers  corresponding  to  Gabor  patch  locations  in  the  search  display.  The  numbers  were  displayed  for  only 
200ms  to  ensure  that  subjects  fixate  the  target  in  order  to  report  the  correct  number.  Subjects  were  then  aske  d  to  report  the  number  at  the  target 
location,  (c)  A  typical  eye  trace  overlayed  on  a  search  array,  showing  an  early  trial,  (d)  A  typical  eye  trace  overlayed  on  a  search  array,  showing  a  late  trial. 
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Stimulus  Presentation  and  Eye-Tracking  Procedures 

The  subjects’  eye  movements  were  recorded  as  they  searched 
for  the  target  in  the  search  array.  Eye  movements  were  recorded  at 
a  sampling  frequency  of  240  Hz,  using  an  infrared-video-based 
eye-tracker  (ISGAN  RK-464)  and  the  pupil  and  corneal  reflection 
from  the  right  eye  were  used  to  determine  the  gaze  position  with 
an  accuracy  of  <1°.  Calibration  was  performed  using  an  online 
system  that  presented  subjects  with  a  central  fixation  point 
followed  by  a  point  at  one  of  nine  locations  on  a  3x3  grid. 
Subjects  had  to  saccade  from  the  central  fixation  point  to  one  of 
the  nine  locations  and  maintain  stable  fixation  (x  and  y  position 
variance  <5  pixels)  for  300ms  (75  samples).  Once  stable  fixation 
was  established  the  next  location  was  presented.  This  process  was 
repeated  until  stable  fixations  at  all  nine  points  were  found.  The 
eye  positions  obtained  were  then  used  to  perform  an  affine 
transform  and  the  transformed  eye  positions  were  displayed  on  the 
screen  for  the  experimenter  to  confirm  that  an  accurate  calibration 
session  had  been  conducted.  During  offline  analysis  a  further  thin- 
plate-spline  interpolation  [30]  was  performed  to  obtain  accurate 
transformation  from  eye-tracker  coordinates  to  screen  coordinates. 

A  recalibration  session  was  performed  every  20  trials  to  correct  for 
possible  head  movements.  Once  transformed,  the  eye-traces  could 
be  overlaid  on  the  images  for  further  analysis  as  shown  in 
figure  1(d). 
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Data  Analysis 

The  subjects’  eye  movements  were  calibrated  as  described 
above  and  an  algorithm  was  used  to  parse  the  eye  movements  into 
saccades  using  a  combination  of  filtered  instantenous  velocity 
measurements  and  a  simple  windowed  Principal  Components 
Analysis  (PC A).  Eye  movement  segments  with  a  minimum  velocity 
30°/^  and  a  minimum  amplitude  of  T  js  were  classified  as 
saccades.  Blinks  were  identified  by  a  pupil  diameter  reading  of 
zero  and  trials  with  either  blinks  or  loss  of  tracking  for  more  than 
10%  of  the  trial  were  removed  from  further  analysis.  Unfortu¬ 
nately,  on  day  two,  one  of  the  subjects’  eye  movements  were  lost 
due  to  machine  failure;  however,  he  completed  all  trials  and 
continued  to  participate  in  the  study.  This  loss  not  withstanding, 
we  retained  97%  of  the  4,900  available  trials,  obtaining  a  total  of 
76,287  saccades  for  analysis. 

We  performed  analysis  on  changes  over  time  in  the  subjects’  eye 
movements  by  constructing  feature  similarity  maps  and  correlating 
these  with  binary  saccade  maps.  The  feature  similarity  maps  were 
constructed  as  follows.  We  first  discretized  the  feature  space  by 
dividing  each  dimension  into  ten  bins  (several  numbers  were  tried 
for  this  and  numbers  between  10—25  bins  gave  similar  results). 
Each  Gabor  patch  was  then  defined  as  a  triplet  of  bin  values 
Gi  =  {hiJi,Oi}  where  hift^Oi  are  the  bins  of  hue,  frequency,  and 
orientation  respectively  of  Gabor  patch  /.  A  feature  similarity  map 
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for  each  trial  consists  of  32  cells  arranged  in  a  4x8  grid  each 
corresponding  to  one  of  the  color  Gabor  patch  in  the  search  array 
for  that  trial.  Similarity  maps  for  each  feature  were  constructed 
individually.  A  feature  similarity  map  for  hue,  for  example  would 
contain  in  each  cell  i  the  difference  between  the  hue  bin  value  hi  of 
the  Gabor  patch  and  the  hue  bin  ht  of  the  target  Gabor  patch  t  for 
the  trial.  In  order  to  maintain  an  intuitive  sense  of  the  similarity 
measure  (high  values  for  high  similarity)  we  computed  similarity 
between  the  target  patch  t  and  a  Gabor  patch  i  for  each  feature  / 
as  ^if=  —  \fi  —ft  I  +  granularity  (where  granularity  was  set  to  ten 
since  we  divided  the  feature  space  into  ten  bins).  Large  values  in 
cells  therefore  mean  that  the  particular  distractor  was  very  similar 
to  the  target  and  vice  versa. 

As  described  before  we  drew  the  features  of  the  distractors  in 
each  display  from  a  uniform  distribution  and  therefore  by  design 
the  bottom  up  uncertainty  in  each  display  averaged  across 
sessions  should  remain  constant.  In  order  to  ensure  that  this  was 
the  case  we  computed  the  Shannon  entropy  in  each  feature 
similarity  map.  This  enabled  us  to  quantify  the  amount  of 
uncertainty  in  our  arrays.  We  then  computed  the  average 
entropy  per  session  and  ran  a  regression  to  look  for  any  trends 
over  time.  As  expected  we  found  no  significant  trends  (color 
=  0.03,/?  =  0.63;  frequency  =  0.12, />  =  0.33,  orientation 

r2  =  0.22/7  =  0.17). 

To  construct  binary  saccade  maps  we  first  assigned  saccade  end 
points  to  Gabor  patches  if  the  distance  from  the  end  point  to  the 
center  of  the  Gabor  patch  was  smaller  than  3.5°.  These 
assignments  allowed  us  to  fill  a  4x8  grid  of  cells  corresponding 
to  the  4x8  grid  of  Gabor  patches,  with  1  for  a  saccade  end  point 
landing  on  the  Gabor  patch  and  a  0  for  no  saccade  towards  the 
patch.  In  this  manner  binary  saccade  maps  were  constructed  and 
later  correlated  with  the  feature  similarity  maps.  When  a  particular 
patch  was  fixated  several  times  we  still  placed  a  one  in  the  map  in 
order  to  retain  the  binary  nature  of  the  saccade  maps. 

Results 

Performance 

Measuring  performance  as  the  percentage  of  correct  trials  for 
each  100-trial  session,  we  found  that  subjects  showed  improved 
performance  over  the  course  of  the  trials  (figure  2).  The  mean 
percentage  performance  of  the  group  was  computed  by  taking  an 
average  of  the  percentage  correct  responses  by  each  of  the  five 
subjects  for  each  session.  A  one-way  ANOVA  showed  an  effect  of 
session  on  mean  performance  (F(9,40)  =  6.88  /7<0.01).  The  change 
in  performance  measured  by  the  slope  (indicative  of  learning  rate)  of 
the  logistic  fit  on  the  data  halfs  at  day  five  and  later  levels  off, 
hovering  around  70%  to  80%  correct  as  shown  in  figure  2. 

This  indicates  that  the  subjects  improved  on  the  task  and 
answered  correctly  a  greater  percentage  of  time  after  conducting 
several  hundreds  of  trials  of  the  task,  despite  the  fact  that  the 
features  and  spatial  location  of  the  target  was  changed  on  every 
trial.  Pooling  together  the  reaction  times  for  each  subject  and 
averaging  across  the  sessions  revealed  an  effect  of  session  on  the 
mean  reaction  time  (figure  3a)  for  our  pool  of  subjects  (one-way 
ANOVA  F(9,4990)  =  50.7 1  /7<0.01).  A  similar  but  weaker  effect 
in  number  of  saccades  was  observed  (one-way  ANOVA 
F(9,4766)=  12.62  /7<0.05)  as  shown  in  figure  3c.  To  ensure  that 
the  performance  improvements  observed  were  not  due  to  a  speed- 
accuracy  tradeoff,  we  normalized  performance  by  the  mean 
number  of  saccades  and  mean  reaction  time  separately.  Mean 
performance  normalized  by  the  mean  number  of  saccades  gave  us 
a  measure  of  subjects’  per-saccade  search  efficiency.  Plotting  this 
as  a  function  of  sessions  (figure  3d),  we  find  an  increased  per 


sessions 


Figure  2.  Performance  results.  Mean  percentage  correct  perfor¬ 
mance  obtained  by  taking  a  mean  across  subjects  for  each  of  the  10 
sessions.  Error  bars  are  SEM  across  subjects.  Smooth  curve  is  a  fit  to  a 
logistic  function  0.62,/?  <0.05). 
doi:1 0.1 371/journal.pone.00091 27.g002 

saccade  efficiency  (one  way  ANOVA  F(9,40)  =  2.43  /7<0.05). 
Similarly,  plotting  mean  performance  (figure  3b)  per  session 
normalized  by  the  mean  reaction  times  we  find  an  upward  trend  of 
search  performance  per  unit  time  spent  searching  (one-way 
ANOVA  F(9,40)  =  3.71  /7<0.01).  These  results  show  a  clear 
improvement  of  all  subjects  on  the  task  with  training.  To  confirm 
that  learning  was  not  just  a  result  of  improvement  in  reporting  the 
numbers  in  the  brief  display,  we  examined  the  accuracy  of 
reporting  the  number  at  the  position  last  fixated.  We  found  that 
the  number  at  the  position  of  last  fixation  matched  the  reported 
number  82.6%  of  the  time  on  incorrect  trials  and  92.8%  on 
correct  trials.  Further  pooling  the  trials  together  and  computing  an 
average  over  each  session,  normalized  by  the  number  of  incorrect 
trials,  we  find  no  effect  of  session  on  report  accuracy  (one-way 
ANOVA  F(9,40)  =  0.77,  p  =  0.65).  Thus,  we  can  rule  out  that 
performance  improvements  might  have  been  due  to  an  improved 
ability  to  read  and  report  the  numbers. 

Differences  in  Basic  Eye  Movement  Statistics 

The  eye  movements  of  all  the  subjects  were  grouped  by  session, 
and  statistics  were  then  computed  on  this  data.  We  first  analyzed 
the  main  sequence,  which  plots  peak  velocity  against  saccadic 
amplitude.  The  main  sequences  for  session  one  and  session  five  are 
shown  in  figure  4a.  To  determine  whether  there  was  a  difference 
between  the  two  sequences  we  first  fitted  a  linear  function  to  the 
main  sequence  of  session  one  and  then  used  this  model  to  predict 
saccade  amplitudes  using  the  peak  velocity  data  from  session  five 
saccades.  We  then  ran  a  two-sample  t-test  between  predicted 
saccade  amplitudes  and  real  saccade  amplitudes  for  session  five 
and  found  no  significant  difference  (p  =  0.50).  The  analysis  of  the 
main  sequences  therefore  revealed  no  effect  of  training  on  these 
saccade  statistics,  and  the  subjects’  eye  movements  were  similar  in 
this  regard.  Similarly,  no  significant  trend  was  found  in  saccadic 
amplitude  or  velocity  individually  (data  not  shown).  However, 
when  we  analyzed  the  ISI  we  found  a  significant  drop  from 
early  sessions  in  training  to  late  sessions,  as  illustrated  in  figure 
4b.  Specifically,  a  one-way  ANOVA  showed  a  strong  effect 
(F(9,73481)  =  43.95, /7  <0.05)  of  session  on  intersaccadic  interval. 
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Figure  3.  Reaction  time  and  saccade  count  data,  (a)  Reaction  time  plotted  as  a  function  of  session  computed  by  pooling  together  all  trials  by  all 
subjects  for  each  session  and  taking  the  mean.  Errorbars  are  SEM.  (b)  Reaction  time  Normalized  Performance  (RNP)  score  computed  by  normalizing 
mean  performance  by  mean  reaction  time  per  session.  Error  bars  are  SEM  taken  across  sessions,  (c)  Saccade  counts  plotted  as  a  function  of  session, 
computed  by  pooling  together  data  from  all  subjects  per  session  and  taking  a  mean.  Errorbars  are  SEM.  (d)  Saccade  count  Normalized  Performance 
(SNP)  score  computed  by  normalizing  mean  performance  by  mean  saccade  count  per  session.  Errorbars  are  SEM. 
doi:1 0.1 371  /Journal. pone.00091 27.g003 


These  results  demonstrate  a  ehange  in  saeeadie  strategy  on  the  part 
of  the  observes,  a  ehange  marked  by  inereased  effieieney  in 
examining  the  Gabor  patehes  and  greater  speed  in  rejeeting  non¬ 
target  Gabor  patehes.  As  expeeted  a  fall  in  ISI  resulted  in  a  drop  in 
reaetion  time  (RT).  However,  we  found  that  RT  was  more  strongly 
dependent  on  the  number  of  saeeades  made  rather  than  on  ISI. 
We  found  a  signifieant  dependenee  (r^  =  0.69,/?  <0.05)  of  RT  on 
the  number  of  saeeades  made  (figure  4e).  A  weaker  dependenee 
(figure  4d)  of  RT  on  ISI  was  found  (r^  =  0.18,/><0.05).  The  data 
shown  in  the  figures  is  for  trials  where  reaetion  time  was  <  lO.S';  the 
results  for  the  full  dataset  were  similar  (RT  vs  saeeade  eount 
r2  =  0.57,;7<0.05  and  RT  vs  ISI  =  0.22,;?  <0.05).  Therefore 
number  of  saeeades  appeared  to  be  more  important  in  determining 
RT  than  ISI. 

Individual  Feature  Similarity  Map  and  Saccade  Map 
Correlations 

Having  eonstrueted  feature  similarity  maps  and  binary 
saeeade  maps,  a  eorrelation  value  between  the  binary  saeeade 


map  and  eaeh  of  the  feature  eorrelations  maps  were  eomputed 
for  eaeh  trial.  Gorrelation  values  for  eaeh  session  were  eomputed 
by  pooling  together  trials  of  all  subjeets  within  a  session  and  then 
eomputing  the  mean.  Figure  5  shows  that,  i)  feature  similarity 
maps  and  binary  saeeade  maps  are  eorrelated,  and  ii)  hue  and 
frequeney  similarity  maps  beeome  inereasingly  eorrelated  as  the 
sessions  progress,  however,  no  sueh  trend  ean  be  observed  for 
orientation.  The  positive  trend  indieates  eorrelations  between 
non-zero  values  in  the  binary  saeeade  map  with  high  values  in 
the  feature  similarity  maps.  This  demonstrates  a  higher 
likelihood  of  subjeets  making  saeeades  towards  items  that  are 
similar  to  the  target. 

The  signifieant  inerease  in  eorrelation  of  the  hue  map  from 
session  one  to  session  five  (paired  t-test  /7<0.05)  demonstrates 
that  subjeets  inereasingly  looked  at  items  that  were  eloser  in  hue 
to  the  target.  There  was  also  a  signifieant  inerease  in  frequeney 
eorrelation  from  session  one  to  session  five  (paired  t-test 
/xO.Ol),  onee  again  demonstrating  a  tendeney  to  saeeade 
towards  items  with  frequeney  more  similar  to  the  target.  This 
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Figure  4.  Saccade  statistics,  (a)  Main  sequence,  plotting  saccade  amplitudes  against  peak  velocity  for  the  first  session  (red)  and  fifth  session  (blue). 
Overlap  shows  no  difference  in  main  sequence,  (b)  Intersaccadic  interval  reduces  with  session  data.  Points  were  computed  by  pooling  saccades  for 
each  session  for  all  subjects  and  taking  a  mean.  Error  bars  are  SEM.  (c)  Reaction  time  as  a  function  of  number  of  saccades.  Regression  line  shows 
significant  correlation  (r^  =  0.58,/?  <0.05).  (d)  Reaction  time  as  a  function  of  intersaccadic  interval.  Regression  shows  weak  correlation 
(r2  =  0.22,;?  <0.05). 
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was  not  the  case  for  orientation,  where  we  found  a  non¬ 
significant  (p  =  0.36)  difference  between  session  one  and  session 
five. 

We  further  quantified  this  result  by  running  a  multiple  logistic 
regression  on  the  data,  examining  the  combined  effect  of  feature 
distances  on  the  probability  of  making  a  saccade  towards  the 
target  in  a  given  session.  Coefficients  obtained  from  this 
regression  were  then  plotted  as  a  function  of  session  and  fitted 


to  a  logistic  function  y  = 


l-\-ce 


— —  (figures  6a,  b,  and  c),  where 


L  is  the  upper  limit  of  the  curve,  and  a  determines  the  slope  of 
the  curve,  while  c  determines  shift  of  the  inflection  point  of  the 
function.  L  is  evaluated  by  computing  an  average  of  the 
coefficient  values  for  sessions  seven  through  ten.  The  coeffi¬ 
cients’  trends  plateau  at  seven  coinciding  with  a  plateau  in 
performance  thus  we  use  the  mean  to  compute  L.  We  then 
linearized  the  function  to  run  a  linear  regression  that  provided  a 
method  for  computing  the  parameters  c  and  a.  The  regressions 
yielded  significant  trends  for  hue  (r^  =0.50, <0.05),  and 


frequency  (r^  =  0.49, <0.05)  coefficients  but  not  for  orientation 
(r2  =  0.18,;?  =  0.2216). 

These  results  demonstrate  a  tendency  of  subjects  to  exploit  hue 
and  frequency  as  the  primary  features  while  giving  lowest  priority 
to  orientation.  This  effect  has  also  been  observed  in  previous 
studies  [31—33]  that  found  a  hierarchy  of  feature  efficacy  in  biasing 
saccades  towards  targets,  with  color  being  the  dominant  feature 
followed  by  size  and  orientation. 

Feature  Combination  Rules 

We  also  investigated  the  question  of  what  combinations  of 
features  might  be  learned.  Several  feature  combination  rules  were 
tested  by  combining  the  similarity  maps  using  different  computa¬ 
tions.  Figure  7  plots  the  correlation  values  across  the  sessions  for 
maps  constructed  using  various  methods  of  combining  the  individual 
feature  maps.  A  linear  combination  rule  for  individual  features  is 
most  widely  used  [34,35]  where  individual  features  are  combined 
through  a  linear  operation  to  form  a  final  saliency  map  that  guides 
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Figure  5.  Single  feature  correlations.  Feature  similarity  maps  are  shown  on  the  left  with  hot  colors  showing  high  similarity.  These  similarity  maps 
are  correlated  with  saccade  maps  to  yield  a  correlation  value  p^y.  The  plot  shows  mean  correlations  per  session  for  each  feature.  Error  bars  are  SEM. 
doi:1 0.1 371  /journal. pone.00091 27.g005 


attention.  Top-down  attention  has  been  hypothesized  to  modulate 
the  eontribution  from  eaeh  map  in  an  optimal  manner  [36]  by 
adjusting  biasing  weights  [37,38].  Correlation  between  binary  eye 
movements  maps  and  feature  similarity  maps  eonstrueted  by 


eombining  linearly  the  hue,  frequeney,  and  orientation  similarity 
maps  (appropriately  weighted)  should  therefore  be  high. 

We  eonstrueted  similarity  maps  by  linearly  summing  the 
individual  feature  similarity  maps  for  all  eombinations  of  the 
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Figure  6.  Multiple  Logistic  regression  results,  (a)  Coeffecient  values  for  each  feature  plotted  as  a  function  of  session,  (b)  Regression  line  fitted  to 
the  coefficient  values  for  hue  =  0.50,/? <0.05),  (c)  frequency  =  0.49, />< 0.05)  and,  (d)  orientation  (r^  =  0.18, /7<0. 2216). 
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sessions 


Figure  7.  Feature  combination  correlations.  Plots  showing 
correlations  of  feature  similarity  maps  combined  using  various 
methods,  as  a  function  of  sessions.  The  black  curve  (Max  map) 
represents  an  upper  bound  computed  by  taking  the  most  correlated 
feature  map  on  each  trial  and  computing  averages  across  all  trials  for 
each  session.  The  correlation  values  for  this  upper  bound  can  be  used 
to  compare  mean  correlation  values  for  all  other  combination  rules  H*F 
(red),  H*F*0  (green),  FI-hF  (blue),  H-hF-hO  (cyan)  and,  point  wise 
minimum  rule  (magenta). 
doi:1 0.1 371  /journal. pone.00091 27.g007 

three  features,  and  found  that  the  map  formed  from  a  linear 
eombination  of  the  hue  and  frequeney  maps  (H-fF),  was  most 
strongly  eorrelated  with  eye  movements. 

To  obtain  an  upper  bound  of  eorrelation  against  whieh  eaeh 
rule  in  figure  7  eould  be  eompared,  we  ereated  a  maximum  map 
(labeled  “MaxMap”  in  the  figure).  The  eorrelation  values  for  this 
map  were  eomputed  by  taking  the  feature  similarity  map  on  eaeh 
trial  that  had  the  strongest  eorrelation  with  the  saeeade  maps  and 
storing  this  eorrelation  value.  The  mean  aeross  trials  was  then 
eomputed  from  this  trial-wise  maximum,  thus  yielding  an  upper 
bound.  We  found  that  the  map  formed  from  the  linear 
eombination  of  hue  and  frequeney  (H-fF  map)  was  the  elosest  to 
the  upper  bound.  A  signifieant  effeet  of  session  on  eorrelation 
values  for  this  map  was  also  observed  (one-way  ANOVA 
T(9, 4666)  =  6.61/7  <0.05).  This  suggests  that  subjeets  attended  to 
the  hue  and  frequeney  features  and  improved  on  the  task  by 
appropriately  tuning  top-down  signals  in  the  hue  and  frequeney 
dimensions. 

We  also  explored  a  multiplieative  eombination  rule  whereby  we 
eombined  the  maps  in  a  point-wise  multiplieative  manner.  Thus  if 
a  feature  at  a  partieular  loeation  is  poorly  matehed  to  the  target’s 
feature  it  will  eliminate  the  ehanee  for  all  other  features  to  seleet 
this  loeation  as  a  potential  target.  This  prediets  a  sparse  salieney 
map,  and  has  the  elements  of  an  AND  operation  on  the  multiple 
feature  maps.  However,  if  we  look  at  the  eorrelation  values  for  the 
multiplieative  map  H*F*0  they  are  not  as  strongly  eorrelated  as 
the  H-fF  map.  Despite  the  weak  eorrelation  we  do  find  a  trend  in 
the  eorrelation  values  for  the  H*F  map  (one-way  ANOVA 
T(9,4666)  =  5.61/7  <0.05).  These  results  demonstrate  a  general 
improvement  in  the  subjeets’  tuning  to  the  features  of  the  target 
upon  preview  and  also  suggests  that  while  the  multiplieative  rule 
makes  for  a  eomputationally  useful  guidanee  strategy,  a  linear  rule 
may  be  a  more  biologieally  plausible  operation. 

We  then  eonstrueted  a  point- wise  minimum  map  whieh  would 
have  the  highest  signal-to-noise  ratio.  The  map  was  eonstrueted  by 


plaeing  in  eaeh  eell  the  value  of  the  least  similar  item.  In  this 
manner  the  map  eontains  low  values  in  all  loeations  exeept  at  the 
target  eell  loeation  where  the  three  feature  maps  would  eontain 
equal  values.  This  strategy  would  eall  on  a  hypothetieal  observer  to 
adopt  the  eounter-intuitive  strategy  of  searehing  for  features  that 
are  most  dissimilar  to  the  target,  thus  highlighting  a  single  loeation 
(target  loeation)  where  no  dissimilarities  are  found.  However,  it  is 
diffieult  to  eoneeive  of  a  neural  strategy  that  would  enable  sueh  a 
meehanism  sinee  it  would  require  pre-eomputation  of  all  three 
feature  maps,  extraetion  of  the  most  diseriminative  feature  for 
eaeh  item,  followed  by  eonstruetion  of  the  final  guidanee  map. 

Discussion 

The  triple  eonjunetion  seareh  task  learned  by  subjeets  in  this 
study  eonsisted  of  displays  that  remained  eonsistent  in  the  number 
of  items  and  bottom-up  uneertainty,  however,  the  target  ehanged 
both  its  loeation  and  features  on  eaeh  trial.  Learning  still  took 
plaee  under  these  eonditions  and  the  eombined  behavioral, 
oeeulomotor,  and  pereeptual  signatures  of  the  improvement  point 
towards  effeets  beyond  task  aequisition.  Behaviorally  we  saw  an 
improvement  in  performanee  with  subjeets  reporting  the  eorreet 
target  on  average  44%  of  the  time  at  the  beginning  of  the  task  to 
an  average  of  7 1  %  after  developing  expertise  in  this  feature-rieh 
environment.  The  oeeulomotor  eorrelate  of  learning  was  evident 
from  the  ehanges  in  saeeadie  behavior,  namely  in  the  shorter  ISI 
with  training.  Differenees  in  basie  saeeade  statisties  in  eonjunetion 
with  visual  seareh  as  well  as  learning  have  not  been  studied 
extensively.  Phillips  et  al.  [39]  argue  that  gains  in  visual  seareh 
performanee  are  a  result  of  an  expansion  in  the  ‘pereeptual  span’ 
and  forward  saeeade  amplitude,  with  a  small  effeet  of  fixation 
duration  whieh  is  equivalent  to  the  ISI  in  our  ease.  The 
improvement  obtained  in  our  ease  suggests  both  that  there  was 
an  inerease  in  pereeptual  span,  as  well  as  redueed  dwell  time  for 
extraeting  information  from  eaeh  fixation. 

Hooge  &  Erkelens  [40]  eondueted  experiments  to  speeify  the 
role  of  fixation  duration  in  visual  seareh  tasks.  The  most  salient 
feature  of  their  study  was  the  reeoneiliation  of  eontradietory 
findings  of  [41]  who  found  signifieant  guidanee  of  saeeades 
towards  items  that  were  similar  in  eolor  to  the  target,  and  Zelinsky 
[42]  who  did  not  find  sueh  guidanee.  Hooge  &  Erkelens  [40] 
provide  a  means  to  make  a  leap  from  oeeulomotor  dynamies  to 
visual  seareh  performanee  using  fixation  duration  as  the  vehiele  for 
understanding  the  differenee.  They  suggest  that  tasks  involving 
diffieult  diseriminations  but  easy  peripheral  seleetions  tend  to 
invoke  longer  fixation  durations,  while  tasks  involving  easy 
diserimination  but  diffieult  peripheral  seleetion  (due  to  either  an 
abundanee  or  similarity  of  distraetors  around  a  target)  tend  to  have 
shorter  fixation  durations  but  evoke  a  greater  number  of  saeeades. 
Our  task  is  a  diffieult  eonjunetion  seareh  where  distraetors  share 
features  with  the  target,  this  makes  it  a  ‘hard-diserimination,  hard- 
seleetion’  task.  Therefore,  initially  we  obtain  high  ISI’s  (in  faet  ISI 
goes  up  from  session  one  to  session  two)  whieh  perhaps  suggests 
that  our  subjeets’  oeeulomotor  strategy  foeused  on  the  foveal 
diserimination  early  in  the  task.  High  saeeade  eount  and  reaetion 
times  suggest  that  the  seleetion  task  was  not  easy  either.  However, 
with  training  we  obtain  mueh  lower  ISIs  whieh  implies  that 
subjeets  improved  on  the  diserimination  task  and  eould  now 
eoneentrate  resourees  on  the  seleetion  task.  Further,  we  find  that 
the  mean  number  of  saeeades  stays  fairly  eonstant  with  subjeets 
seanning  over  half  the  number  of  items  on  average.  Thus,  there  is 
no  signifieant  ehange  in  the  number  of  seleetions  made  during  the 
seareh  proeess,  however,  the  ‘quality’  of  the  seleetions  improves, 
i.e.  the  distraetors  ehosen  as  potential  targets  are  eloser  in  their 
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features  to  the  target.  The  quieker  ISIs  may  point  toward  an 
inereased  ‘pereeptual  span’  [43]  or  Visual  lobe’  [44]  that  enables 
examination  of  a  greater  number  of  items  in  eaeh  saeeade, 
however,  additional  experiments  would  be  required  to  eonfirm  this 
elaim. 

The  oeeulomotor  eorrelate  of  learning  (i.e.  improved 
diserimination  by  moving  from  diseriminative  seareh  to  seleetive 
seareh)  then  makes  the  predietion  that  subjeets  would  have  a 
higher  tendeney  to  make  saeeades  towards  patehes  that  are 
similar  to  the  target  as  they  transition  from  diseriminative  seareh 
to  seleetive  seareh.  Indeed  this  is  what  we  found  when  we 
eorrelated  saeeade  maps  with  feature  similarity  maps.  By 
running  a  multiple  logistie  regression  we  found  that  whether  a 
pateh  was  seleeted  for  fixation  eould  be  predieted  by  the 
similarity  of  its  features  to  the  target  and  level  of  training  of  the 
subjeets.  These  results  on  the  similarity  effeet  [45]  serve  as 
eorroboration  of  several  previous  studies  ineluding  [31]  who 
found  that  monkeys  make  fixations  to  items  that  are  similar  in 
eolor  but  not  orientation.  Findlay  &  Gilehrist  [45]  also  found  a 
proximity  effeet,  i.e.,  a  tendeney  of  saeeades  to  fall  near  the 
target  in  spaee.  Motter  &  Belky  [31]  also  investigated  this 
seleetion  for  eolor  as  a  guiding  feature  over  orientation.  They 
eonelude  from  their  1998  study,  as  well  as  eleetrophysiologieal 
studies  in  V4  [46,47],  that  V4  neurons  eoded  more  strongly  for 
stimuli  in  their  reeeptive  field  that  matehed  the  top-down  goal 
rather  than  the  absolute  eolor  of  the  stimuli.  This  suggests  that  a 
eolor  feature  map  would  be  the  tool  of  ehoiee  for  top-down 
attention  in  the  guidanee  of  saeeades.  Our  study  also 
demonstrates  a  preferenee  for  spatial  frequeney  over  orienta¬ 
tion.  Several  other  studies  [32,33]  have  found  a  similar 
preferenee  for  eolor  as  a  guiding  feature,  and  Wolfe  &  Horowitz 
[48]  have  plaeed  eolor  on  top  of  the  list  of  features  that  guide 
attention.  We  hypothesize  that  spatial  frequeney  eould  be 
eonsidered  a  ‘surfaee  property’  mueh  like  texture  and  eolor  that 
have  desirous  qualities  for  the  guidanee  of  attention.  However, 
the  eurrent  experiment  does  not  address  this  feature-seleetive 
guidanee  and  it  would  require  further  experiments  to  verify  why 
orientation  is  a  weaker  eue  for  top-down  attention  in  the 
presenee  of  other  features. 

In  this  study  the  top-down  goal  ehanged  on  eaeh  trial  and 
despite  this  we  saw  an  inereased  similarity  effeet  whieh  suggests 
that  aetivity  of  neurons  in  the  visual  eortex  (e.g.  V4  neurons)  ean 
be  biased  in  a  highly  dynamie  and  rapid  manner  from  one  trial  to 
the  next.  Therefore  departing  from  typieal  pereeptual  learning 
studies  we  show  evidenee  for  learning  that  involves  top-down 
proeesses.  Herzog  &  Fahle  [49]  put  forward  a  reeurrent  neural 
network  model  of  pereeptual  learning  that  empahsizes  the  role  of 
plastieity  in  the  top-down  eonneetions  as  an  enabling  proeess  for 
pereeptual  learning.  They  show  that  even  in  a  task  like  vernier 
diserimination,  where  learning  is  both  speeifie  to  stimulus  features 
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and  spadal  location,  a  model  that  incoq^orates  top-down 
influences  has  more  explanatory  power  than  pure  bottom-up 
models  of  improvement.  Specifically  they  show  that  in  a  model 
where  top-down  connections  gate  flow  of  bottom-up  inputs  to 
decision  units,  learning  acts  upon  the  weights  of  the  top-down 
connections  rather  than  tuning  properties  of  the  bottom-up 
(sensory)  inputs.  The  current  study  can  also  be  placed  in  this 
context,  situating  the  locus  of  plasticity  in  the  top-down  process 
rather  than  the  bottom-up  sensory  process.  However,  in  addition 
to  this  the  increase  in  the  similarity  effect  that  we  find,  suggests 
that  the  ability  to  quickly  switch  the  top-down  signal  also 
improved.  It  is  certainly  the  case  that  there  is  a  task-based  effect 
and  we  cannot  ascertain  the  exact  amount  of  contribution  which 
exclusive  improvement  in  top-down  biasing  made  toward  progress 
in  the  task.  However,  it  is  clear  from  our  analysis  of  correlation 
between  feature  similarity  maps  and  binary  saeeade  maps  that 
there  is  enhanced  guidance  through  better  top-down  biasing.  We 
find  that  training  enhances  the  similarity  effect  and  a  possible 
mechanism  for  this  is  improved  top-down  biasing.  This  enhances 
the  right  neurons  which  in  turn  guides  attention  to  patches  that  are 
increasingly  similar  to  the  target. 

Conjunction  searches  define  targets  using  a  combination  of 
features,  and  binding  of  these  features  according  to  feature 
integration  theory  [34]  requires  attention.  We  examined  the 
correlations  of  binary  saeeade  maps  and  different  combinations  of 
feature  similarity  maps  and  found  that  a  linear  combination  of  the 
features  hue  and  frequency  was  most  highly  correlated  with 
saeeade  maps.  We  tried  a  multiplicative  rule  which  provides  the 
sparsest  final  similarity  since  it  penalizes  differences  in  a  single 
feature  while  greatly  boosting  locations  with  a  single  matched 
feature.  A  similarity  map  constructed  from  a  multiplication  of  hue 
and  frequency  was  closely  matched  in  terms  of  correlation  with  eye 
movements  to  the  linear  H-bF  map  however,  the  H'^F^O  map  was 
poorly  corrleated  with  eye  movements.  A  multiplicative  rule 
however,  does  not  account  for  the  serial  search  times  for 
conjunction  searches  since  a  precomputation  of  this  multiplicative 
combination  of  features  would  put  a  hot-spot  in  a  salience  map  at 
the  location  where  all  features  match  the  target  with  high  SNR. 
Overall  this  exploration  points  towards  a  linear  combination  rule 
that  may  be  at  play.  That  said,  our  discussion  of  the  similarity 
effect  also  suggests  a  pre-attentive  guidance  of  saeeades  towards 
potential  targets.  And  if  guidance  is  pre-attentive  and  feature 
combination  requires  attention,  the  prediction  would  be  that 
conducting  a  conjunctive  search  is  a  serial  process  with  respect  to 
spatial  attention  and  feature-based  attention,  and  thus  inefficient. 
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Attention  exhibits  characteristic  neural  signatures  in 
brain  regions  that  process  sensory  signals.  An  important 
area  of  future  research  is  to  understand  the  nature  of  top- 
down  signals  that  facilitate  attentional  guidance  to¬ 
wards  behaviorally  relevant  locations  and  features.  In 
this  review,  we  discuss  recent  studies  that  have  made 
progress  towards  understanding:  (i)  the  brain  structures 
and  circuits  involved  in  attentional  allocation;  (ii)  top- 
down  attention  pathways,  particularly  as  elucidated  by 
microstimulation  and  lesion  studies;  (iii)  top-down  mod¬ 
ulatory  influences  involving  subcortical  structures  and 
reward  systems;  (iv)  plausible  substrates  and  embodi¬ 
ments  of  top-down  signals;  and  (v)  information  proces¬ 
sing  and  theoretical  constraints  that  might  be  helpful  in 
guiding  future  experiments.  Understanding  top-down 
attention  is  crucial  for  elucidating  the  mechanisms  by 
which  we  can  filter  sensory  information  to  pay  attention 
to  the  most  behaviorally  relevant  events. 

Introduction 

Language  is  infused  with  idiomatic  expressions  that  make 
explicit  the  distinction  between  bottom-up  (BU)  and  top- 
down  (TD)  processes  of  attention.  We  might  ask  someone  to 
‘pay  attention  to  the  road’  while  driving,  which  implies  a 
voluntary  choice  to  allocate  resources  to  a  subset  of  the 
perceptual  input.  Alternatively,  we  might  remark  that  the 
orange  sports  car  really  ‘caught  our  attention’.  In  this  case, 
the  resource  has  been  involuntarily  captured  rather  than 
voluntarily  allocated.  The  distinction  is  not  limited  to 
idiomatic  expressions,  but  rather  stems  from  disparate 
modes  of  attentional  processing  [1].  BU  attention  is 
deployed  very  rapidly  and  depends  exclusively  on  the 
properties  of  a  sensory  stimulus.  By  contrast,  TD  attention 
is  slower  and  requires  more  effort  to  engage. 

In  the  modality  of  vision,  the  two  modes  (BU  and  TD) 
give  rise  to  the  psychophysical  phenomenon  of  pop-out  and 
set-size  effects.  In  a  typical  visual  search  experiment,  a 
subject  is  presented  with  a  number  of  items  on  a  display 
and  is  asked  to  find  a  target  item  within  this  display,  such 
as  a  bar  with  a  particular  orientation,  or  color,  or  a  combi¬ 
nation  of  the  two.  Pop-out  occurs  when  the  target  item  is 
significantly  distinct  from  the  surrounding  items  (distrac- 
tors),  such  as  a  horizontal  bar  among  several  vertical  bars. 
This  different  item  automatically  attracts  BU  attention  (or 
pops-out)  rapidly  and  independently  of  the  number  of 
distractors  [2,3].  By  contrast,  when  the  target  item  is 
distinguished  only  by  taking  into  account  the  conjunction 
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of  its  features,  such  as  color  and  orientation,  BU  cues  alone 
cannot  efficiently  guide  attention  and  TD  attention  must 
be  recruited  to  scan  the  display.  This  gives  rise  to  search 
times  that  increase  with  the  number  of  distractors;  in  other 
words,  a  set-size  effect  is  observed.  In  most  real-life  situa¬ 
tions,  the  responses  of  the  nervous  system  to  a  sensory 
input  depend  on  both  BU  influences  driven  by  the  sensory 
stimulus  and  TD  influences  shaped  by  extra-retinal  factors 
such  as  the  current  state  and  goal  of  the  organism  [4,5] . 

A  distinction  is  also  made  between  two  types  of  TD 
mechanisms.  The  first  type  is  intuitively  associated  with 
TD  and  is  called  the  volitional  TD  process,  which  can  exert 
its  influence  through  acts  of  will.  The  second  type  is  known 
as  a  mandatory  TD  process  and  it  is  an  automatic,  percept¬ 
modifying  TD  mechanism  that  is  pervasive  and  that  voli¬ 
tion  cannot  completely  eliminate.  The  latter  TD  process 
can  develop  through  experience-dependent  plasticity  or 
during  development,  and  includes  contextual  modulation 

Glossary 

BU  influence:  influence  on  the  nervous  system  due  to  extrinsic  properties  of 
the  stimuli. 

Conjunction  search:  search  task  in  which  a  subject  is  required  to  find  a  target 
item  among  several  distractors,  and  the  target  is  defined  by  a  unique 
conjunction  of  features.  In  this  type  of  search  task,  locating  the  target  is  more 
difficult  because  distractors  share  some  of  the  features  of  the  target  and  thus 
the  target  does  not  obviously  stand  or  pop  out. 

Covert  attention:  attention  paid  to  a  subset  of  the  sensory  inputs  through 
mental  focusing. 

Feed-forward  sweep:  first  epoch  of  neural  activity  that  travels  from  lower  to 
higher  visual  areas  on  the  onset  of  a  visual  stimulus  via  feed-forward 
connections. 

Mandatory  TD  process:  attentional  process  that  influences  sensory  processing 
in  an  automatic  and  persistent  manner. 

Overt  attention:  attention  paid  through  orienting  of  sensory  organs  toward  a 
sensory  input  of  interest. 

Percept:  mental  impression  of  an  external  stimulus. 

Pop-out  search:  search  task  in  which  a  subject  is  required  to  find  a  target  item 
among  several  distractors,  and  the  target  is  defined  by  a  unique  visual  feature 
not  shared  with  any  of  the  distractors.  The  target  thus  stands  or  pops  out  and  is 
easy  to  find. 

Priority  map:  map  of  visual  space  constructed  from  a  combination  of 
properties  of  the  external  stimuli,  and  intrinsic  expectations,  knowledge  and 
current  behavioral  goals. 

Recurrent  epoch:  second  epoch  of  neural  activity  that  occurs  after  an  initial 
response  to  onset  of  a  stimulus  and  is  mediated  by  intra-cortical  horizontal 
connections  and  inter-cortical  feedback  connections. 

Saliency  map:  map  of  stimulus  conspicuity  over  visual  space. 

Set-size  effect:  in  search  tasks,  a  set-size  effect  is  observed  if  the  time  required 
to  find  the  target  depends  on  the  total  number  of  items  in  the  display  (the  set 
size). 

Task-relevance  map:  map  of  behaviorally  relevant  locations  over  visual  space. 
TD  influence:  influence  on  the  nervous  system  due  to  extra-retinal  effects  such 
as  intrinsic  expectations,  knowledge  and  goals. 

Volitional  TD  process:  attentional  process  that  exerts  influence  on  sensory 
processing  through  an  act  of  volition,  such  as  willfully  shifting  attention  to  the 
right  part  of  space. 
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(a)  (b) 
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Figure  1.  Mandatory  versus  volitional  TD  processes,  (a)  Rubin's  vase  illusion  is  an 
example  of  a  volitional  TD  process.  The  percept  can  be  switched  from  face  to  vase 
and  vice  versa  through  act  of  will,  demonstrating  TD  modulation  of  perceptual 
processing  in  a  volitional  and  dynamic  manner,  (b)  Four  frames  from  a 
demonstration  of  a  rotating  mask  that  seems  to  be  convex,  even  in  places  where 
it  is  in  fact  concave.  At  the  start  of  the  rotation  (top  two  frames),  the  mask  is  convex 
and,  as  it  rotates,  the  viewer  begins  to  see  the  inside  of  the  mask  (bottom  two 
frames),  but  this  still  seems  to  be  convex.  This  demonstrates  an  inherent  bias  in 
perceiving  faces  as  convex  rather  than  concave,  even  when  this  contradicts  BU 
sensory  information,  and  thus  provides  an  example  of  mandatory  TD  processing. 
Reproduced  with  permission  from  [134]. 

of  perception  [5,6].  A  striking  example  of  the  dichotomy 
between  these  two  mechanisms  is  presented  in  Figure  1. 

Previous  work  has  extensively  studied  the  effects  of  TD 
attention  on  target  brain  regions,  including  modulatory 
effects  in  early  sensory  areas  [5,7,8].  Significant  progress 
has  been  made  in  isolating  the  possible  sources  of  TD 
signals  [9] ,  especially  within  the  now  well-studied  fronto¬ 
parietal  attention  network  [10].  Much  less  understood  at 
present  are  the  exact  pathways,  contents,  meaning  and 
form  of  the  signals  that  are  sent  from  the  top  down.  Here, 
we  review  recent  findings  from  physiology,  lesion  and 
computational  studies  that  have  attempted  to  elucidate 
the  mechanisms  and  signals  involved  in  TD  modulation  of 
sensory  processing.  To  focus  this  review,  we  mainly  con¬ 
cern  ourselves  with  visual  perception  and  the  volitional  TD 
process,  although  similar  principles  can  apply  in  other 
modalities. 

Brain  structures  and  circuits  of  visual  attention 

Visual  processing  begins  in  the  retina,  which  sends  parallel 
streams  of  information  to  the  brain  through  its  diverse  set 
of  retinal  ganglion  cells  and  their  unique  interactions 
within  the  retinal  circuitry  [11].  A  majority  of  retinal 
projections  reach  the  lateral  geniculate  nucleus  (LGN) 
and  a  much  smaller  number  (approx.  10%)  connect  to 
the  superior  colliculus  (SC).  The  LGN  sends  projections 
to  the  primary  visual  cortex  (VI),  the  initial  site  of  proces¬ 
sing  in  the  cortical  feed-forward  visual  pathway.  This 
pathway  has  been  functionally  divided  into  the  dorsal 


and  ventral  streams  [12].  The  dorsal  stream  has  been 
described  as  the  ‘where’  pathway  and  leads  from  area 
VI  to  motion  processing  areas  [medial  temporal  (MT) 
and  medial  superior  temporal  (MST)]  and  parietal  cortices. 
The  ventral  or  ‘what’  pathway  comprises  striate  (VI)  and 
extrastriate  areas  (V2,  V3,  V4)  and  leads  to  the  inferotem- 
poral  cortex  (IT),  believed  to  be  the  last  feature-selective 
area  in  the  visual  processing  hierarchy. 

Modulatory  effects  of  attention  have  been  observed  in 
the  constituent  structures  of  both  the  dorsal  and  ventral 
streams.  The  first  structure  subject  to  strong  attention 
effects  is  the  SC.  The  SC  is  a  layered  midbrain  structure 
that  receives  direct  input  from  the  retina,  as  well  as 
feedback  inputs  from  area  VI.  Salient  visual  events  are 
represented  in  the  superficial  layers  of  the  SC  [13,14]  and 
can  further  combine,  in  the  deeper  layers,  with  TD  infor¬ 
mation  to  give  rise  to  a  priority  map  that  guides  attention 
[14].  This  attention  map  is  probably  shared  or  jointly 
computed  with  the  lateral  intraparietal  (LIP)  region  of 
the  cortex  [15],  the  frontal  eye  fields  (FEF)  [16]  and  visual 
cortices,  through  direct  afferent  connections  from  the  cor¬ 
tex  to  the  SC,  as  well  as  indirect  efferent  connections  from 
the  SC  to  the  cortex  via  the  pulvinar  [17].  These  connec¬ 
tions  are  important  for  communicating  attention-related 
signals  to  higher  cortical  areas  while  bypassing  the  canon¬ 
ical  ventral  pathway. 

Situated  a  level  above  the  SC  in  the  visual  processing 
hierarchy,  are  the  thalamic  nuclei,  which  are  involved  in 
processing  many  types  of  sensory  information  and  are 
susceptible  to  modulation  by  attention.  The  LGN  is  the 
most  visually  responsive  of  the  thalamic  nuclei,  and  both 
physiological  studies  in  monkeys  and  imaging  studies  in 
humans  have  shown  that  attention  can  modulate  signals  in 
the  LGN  [18,19].  The  modulation  includes  enhancement  of 
neural  responses  to  attended  stimuli  and  suppression  of 
unattended  stimuli  [19] .  Thus,  visual  sensory  information 
is  already  subject  to  attentional  modulation  even  before 
entering  the  cortex. 

The  first  cortical  stage  of  visual  processing,  area  VI,  is 
the  first  major  feature-sensitive  area  of  processing  and  is 
also  modulated  by  attention.  However,  these  effects  are 
relatively  weak  [20,21].  Moving  up  the  visual  processing 
hierarchy  from  VI,  V2,  V4  to  IT,  receptive  field  sizes 
increase  and  visual  areas  are  progressively  more  sensitive 
to  features  than  spatial  locations  of  stimuli.  When  atten¬ 
tion  is  allocated  to  a  certain  part  of  visual  space,  neurons 
encoding  this  part  are  facilitated  (a  phenomenon  known  as 
spatial  attention).  The  allocation  of  attention  to  a  particu¬ 
lar  non-spatial  feature,  such  as  the  color  or  orientation  of 
an  object,  facilitates  neurons  encoding  the  attended  fea¬ 
ture  (feature-based  attention).  Along  the  ventral  pathway, 
extrastriate  areas  V4  and  IT  have  large  receptive  fields  and 
effects  of  feature-based  attentional  modulation  are  more 
evident.  Motion-sensitive  MT  and  MST  areas  are  also 
modulated  by  both  spatial  and  feature-based  attention 
[22].  This  tendency  for  combined  modulation  of  sensory 
signals  by  both  spatial  and  feature-based  attention 
increases  as  the  signals  progress  from  lower  to  higher 
cortical  areas  such  as  the  LIP. 

The  LIP  area  has  been  studied  extensively  and  several 
excellent  recent  reviews  have  described  its  diverse  roles  in 
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attention,  reward,  and  occulomotor  behavior  [15,23,24].  It 
is  important  to  point  out  that  responses  in  area  LIP  can  be 
driven  by  both  BU  factors,  such  as  stimulus  salience,  and 
TD  factors,  such  has  behavioral  relevance  of  stimuli  [25], 
the  current  locus  of  attention  [26]  and  occulomotor  plan¬ 
ning  [15].  Therefore,  the  LIP  is  another  candidate  struc¬ 
ture  (beyond  the  SC  described  above)  where  BU  and  TD 
influences  can  combine  to  give  rise  to  a  spatial  priority  map 
[15].  The  many  facets  of  observed  responses  in  the  LIP  can 
be  attributed  to  the  fact  that  both  BU  and  a  diverse  set  of 
TD  influences  can  give  rise  to  behavioral  priority,  and  thus 
modulate  LIP  responses,  which  suggests  that  the  LIP 
encodes  priority  in  a  manner  largely  agnostic  to  the  factors 
that  caused  the  priority  [15].  Through  direct  feedback 
connections  [27]  or  connections  via  the  pulvinar  to  visual 
areas  (see  below),  the  LIP  can  communicate  the  fused 
signals  to  other  brain  areas  for  biasing  or  further  atten- 
tional  processing. 

FEF  neurons  also  represent  salient  stimuli,  speciflcally 
stimuli  that  vary  signiflcantly  from  surrounding  items  in  a 
visual  display  (known  as  odd-ball  stimuli).  The  FEF  has 
also  been  described  as  a  region  with  neural  responses 
characteristic  of  a  priority  map  [16].  Single-unit  responses 
in  monkey  FEF  exhibit  transients  on  stimulus  onset, 
followed  by  a  later  response  (latency  of  ~100  ms)  that 
discriminates  an  odd-ball  stimulus  from  surrounding  dis- 
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tractors  [14,16].  This  suggests  that  the  FEF  computes 
salience  in  the  recurrent  epoch  rather  than  the  initial 
feed-forward  sweep  [28,29].  The  FEF’s  connections  to  mo¬ 
tor  neurons  in  intermediate  and  deep  layers  of  the  SC 
make  it  an  important  structure  in  occulomotor  behaviors 
associated  with  attention.  In  addition  to  this  role  of  the 
FEF  in  representing  BU  salience,  we  examine  in  the 
following  section  its  involvement  in  projecting  TD  signals 
to  other  regions  of  the  attentional  network. 

Effects  of  attention  have  also  been  observed  in  prefron¬ 
tal  cortex  (PFC).  The  PFC  is  thought  to  be  involved  in 
short-term  memory  processes,  and  recent  studies  suggest 
that  the  PFC  also  exhibits  strong  attentional  selection 
related  signals  [30,31].  Owing  to  its  involvement  in 
short-term  memory  and  its  position  high  in  the  visual 
hierarchy,  it  is  also  the  primary  candidate  for  generating 
TD  signals  and  sending  them  to  sensory  cortex  for  spatial 
or  feature-based  attentional  biasing. 

Therefore,  the  LGN,  the  striate  and  extrastriate  cortex 
(areas  VI,  V4,  IT  and  MT),  as  well  as  the  SC,  pulvinar,  LIP, 
FEF  and  PFC,  are  known  to  be  involved  in  attentional 
processes.  Modulatory  attentional  signals  are  found  as 
early  as  in  the  SC  (a  brainstem  structure)  and  in  the 
LGN,  the  first  stop  along  the  visual  processing  hierarchy 
[18,19].  These  signals  act  progressively  sooner  and  with 
stronger  modulatory  power  going  up  from  area  VI  to  area 
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Figure  2.  Flow  of  attentional  signals  in  brain  structures  that  have  been  implicated  as  being  involved  in  attentional  studies.  The  flash  symbol  indicates  that  a  candidate 
structure  has  been  microstimulated  and  an  X  indicates  that  the  structure  has  been  lesioned  in  a  previous  study  (see  Table  1  for  details).  The  connections  show  the  most 
likely  type  of  signal  being  transmitted  between  two  areas;  TD  signals  are  shown  in  blue,  BU  signals  in  red  and  bidirectional  signals  in  gray.  Abbreviations:  SC,  superior 
colliculus;  SNr,  substantia  nigra  pars  reticulata;  MD,  mediodorsal  thalamus;  LGN,  lateral  geniculate  nucleus;  IT,  inferotemporal  cortex;  MT,  middle  temporal  area;  LIP,  lateral 
intraparietal  area;  FEF,  frontal  eye  fields;  PFC,  prefrontal  cortex. 
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IT  [20].  These  signals  can  bias  attention  for  particular 
visual  locations  [32],  visual  features  [33-36],  or  both. 
The  characteristic  signature  of  these  attentional  modula¬ 
tions  onto  target  sensory  areas  includes  heightened  gain, 
sharpened  tuning  and  other  end-effects,  as  reviewed  pre¬ 
viously  [8,37,38].  In  the  following  section,  we  examine  the 
areas  that  are  specifically  involved  in  mediating  TD  atten¬ 
tional  signals. 

Pathways  of  TD  attention 

In  this  section,  we  focus  on  lesion  and  electrophysiological 
studies,  particularly  those  using  methods  of  microstimula¬ 
tion  and  simultaneous  recordings  in  the  brain  areas  iden¬ 
tified  in  the  previous  section.  These  areas  form  an 
attentional  network  (Figure  2)  and  we  consider  how  TD 
information  is  relayed  in  this  network.  Microstimulation, 
together  with  reversible  inactivation  [using  either  phar¬ 
macological  agents  such  as  muscimol  or  transcranial  mag¬ 
netic  stimulation  (TMS)]  and  permanent  lesion  studies, 
have  enabled  researchers  to  go  from  correlation  to  causa¬ 
tion  in  the  study  of  perception  and  attention  (Table  1). 

It  has  been  suggested  that  all  sensory  stimuli  compete 
for  entry  into  working  memory  [39].  Working  memory  not 
only  stores  information,  but  also  enhances  this  information 
and  actively  generates  TD  attentional  signals  that  bias 
feature-sensitive  brain  regions,  and  is  thus  vital  for  accom¬ 
plishing  behavioral  goals  [39].  An  elegant  study  demon¬ 
strated  that  the  PFC  transmits  the  contents  of  working 
memory  to  the  visual  system  by  using  a  posterior-split- 
brain  paradigm  [40].  In  this  study,  monkeys  were  pre¬ 
sented  with  a  visual  cue  in  either  the  left  or  right  hemifield, 
followed  by  a  probe  stimulus.  The  task  was  to  respond  to 
the  appearance  of  the  probe  that  had  previously  been 
associated  with  the  cued  item.  BU  signals  were  recorded 
by  presenting  the  cue  in  the  hemifield  ipsilateral  to  the 
recording  site  in  the  IT  (i.e.  direct  BU  path  from  the  retina 
up  to  IT),  whereas  TD  signals  could  be  recorded  from  area 
IT  by  presenting  the  cue  in  the  visual  hemifield  contralat¬ 
eral  to  the  recording  site  in  the  IT.  The  posterior  callosum 
transection  precluded  direct  communication  between  visu¬ 
al  cortices  from  both  sides  of  the  brain,  so  it  was  hypothe¬ 
sized  that  the  TD  signals  were  fed  back  from  the  PFC  to 
area  IT  (Figure  3a).  To  move  to  a  more  causal  explanation, 
the  next  experiment  involved  transection  of  the  anterior 
corpus  callosum  (thereby  cutting  that  hypothetical  path¬ 
way),  which  resulted  in  a  lack  of  response  from  the  IT  cells 
[40] .  These  results  demonstrated  that  TD  signals  correlat¬ 
ing  with  working  memory  emanate  from  the  PFC  and  feed 
back  into  the  ventral  stream.  A  more  recent  study  also  used 
the  posterior-split-brain  paradigm  in  conjunction  with  uni¬ 
lateral  PFC  removal  and  demonstrated  that  performance 
on  a  search  task  was  mainly  impaired  when  the  goal  of  the 
search  was  switched  on  a  regular  basis  [41] .  This  study  thus 
highlighted  the  importance  of  the  PFC  in  switching  the  TD 
context.  It  has  also  been  found  that  microstimulation  of  the 
PFC  leads  to  biases  in  target  selection  towards  or  away 
from  the  stimulation  field,  which  demonstrates  how  TD 
signals  can  affect  occulomotor  behavior  [42] .  Furthermore, 
the  sheer  connectedness  of  the  PFC  suggests  that  its  effects 
are  pervasive  and  are  driven  by  a  combination  of  goals, 
rewards,  salience,  and  planning  of  motor  actions  [9,39]. 


The  next  area  proximal  to  the  PFC,  and  an  important 
player  in  TD  attention,  is  the  FEF.  Sub-threshold  FEF 
stimulation  enhances  responses  of  V4  neurons  in  the  pres¬ 
ence  of  a  stimulus  in  their  receptive  field  (Figure  4a)  [43] . 
This  demonstrates  that  descending  TD  signals  from  the 
FEF  bias  processing  in  area  V4.  These  results  were  repli¬ 
cated  in  analogous  regions  of  the  barn  owl  [44] .  The  com¬ 
parison  of  local  field  potentials  (LFP,  which  may  be 
strongly  driven  by  afferent  inputs  from  other  brain  regions) 
and  spiking  activity  in  the  FEF  (which  represents  intrinsic 
activity  of  FEF  neurons)  revealed  that  target-selective 
signals  appeared  in  spiking  activity  before  showing  a 
difference  in  the  LFP,  which  suggests  that  spatial  selection 
was  computed  locally  in  the  FEF  [29] .  There  is  speculation 
that  this  emergence  of  selection  is  communicated  down  to 
ventral  regions  through  a  synchronization  of  gamma-band 
activity  between  the  FEF  and  area  V4  [45].  However,  a 
lesion  study  demonstrated  that  temporary  inactivation  of 
the  FEF  (using  a  GABA-A  receptor  agonist,  muscimol)  led 
to  deficits  not  only  in  visually  guided  saccades,  but  also  in 
shifts  of  attention  during  either  pop-out  or  conjunction 
visual  searches  [46].  Contrary  to  an  earlier  study  [29], 
these  findings  suggested  that  the  FEF,  although  involved 
in  covert  attention,  does  not  locally  compute  the  selection 
but  is  rather  a  participant  in  a  network  with  heavy  in¬ 
volvement  of  the  LIP. 

Area  LIP  is  strongly  connected  to  the  FEF  and  is  inte¬ 
gral  to  the  attentional  network  through  both  anatomical 
and  functional  characterization.  Suprathreshold  microsti¬ 
mulation  in  the  posterior  parietal  cortex  (PPG),  which 
includes  both  area  LIP  and  the  ventral  intraparietal  area 
(VIP),  induces  saccades;  however,  the  current  required  to 
induce  saccades  is  significantly  higher  compared  to  that 
required  when  microstimulating  the  FEF,  which  suggests 
that  the  connection  from  the  PPG  to  the  occulomotor 
system  might  not  be  a  direct  one.  Subthreshold  stimulation 
results  in  a  shift  of  covert  attention  [47].  Interestingly,  a 
non-spatial  effect  was  also  found  whereby  reaction  times  in 
detecting  a  target  decreased  irrespective  of  whether  a 
valid,  invalid,  or  no  cue  was  presented  [47] .  This  suggested 
that  microstimulation  of  the  LIP  can  override  the  cue 
signal  and  orient  attention  to  the  visual  location  corre¬ 
sponding  to  the  site  of  stimulation.  Evidence  from  lesion 
studies  demonstrates  that  damage  or  inactivation  of  the 
LIP  causes  deficits  only  in  the  presence  of  multiple  stimuli 
[48,49] .  These  results  point  to  an  additional  role  of  the  LIP 
in  resolving  competition  among  stimuli  represented  at 
lower  levels  through  TD  connections  to  these  levels  [7,50]. 

The  aforementioned  studies  did  not,  however,  differen¬ 
tiate  between  the  dorsal  (LIPd)  and  ventral  (LIPv)  sub¬ 
divisions  of  the  LIP.  In  a  more  recent  study,  the  effects  of 
local  reversible  inactivation  (using  a  GABA-A  receptor 
agonist)  in  areas  LIPd  and  LIPv  have  been  studied  sepa¬ 
rately  [51].  Interestingly,  the  many  dimensions  of  LIP 
responses  demonstrated  previously  [23]  were  shown  to 
reside  in  disparate  subdivisions  of  the  LIP.  Inactivation 
of  the  LIPd  affected  performance  on  simple  saccade  tasks 
but  left  visual  search  intact,  whereas  temporary  lesions  of 
the  LIPv  led  to  deficits  in  both  search  and  saccadic  perfor¬ 
mance  [51].  The  authors  stressed  that  deficits  in  saccadic 
performance  after  LIPd  inactivation  were  far  smaller  than 
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Table  1.  Microstimulation  and  lesion  studies  of  different  brain  structures  involved  in  attention. 


Brain  region 

Microstimulation  studies 

Refs 

Lesion  studies 

Refs 

Implications  for  attentional  processing 

Implications  for  attentional  processing 

SC 

Shift  of  spatial  attention 

[55] 

Deficit  in  target  selection 

[58,68] 

Perceptual  facilitation  at  site  of  stimulation 

[56] 

Deficit  in  perceptual  decision  in 
presence  of  distractors 

[60] 

Selection  of  target  independent  of  motor  plan 

[57] 

Signal  transmitted  to  MT  via  Pulvinar 

[70] 

LGN 

Elicits  visual  percepts 

[111] 

Eliminates  residual  visual  responses  in 
extrastriate  cortex  after  VI  lesion 

[112] 

Disruption  of  smooth  pursuit  eye  movements 

[113] 

Deficits  in  target  detection  (human) 

[110] 

Pulvinar 

No  deficit  in  saccadic  behavior 

[67] 

No  deficit  in  visual  search 

[68] 

- 

Deficit  in  suppression  of  distractors  during 
search  (human) 

[65] 

Spatial  and  temporal  attention  deficits  with 
anterior  and  posterior  lesions  respectively 
(human) 

[66] 

VI 

Target  selection  disrupted  with  upper  layer 
stimulation,  facilitated  with  lower  layer  stimulation 

[114] 

Deficit  in  motion  detection  and  discrimination 

[117] 

Lower  current  thresholds  needed  for  evoking 
saccades  in  lower  layers 

[115] 

Deficit  in  saccade  targeting 

[118] 

Median  current  of  5.2  p,A  (6.6  jjlA)  required  for 
behavioral  detection  of  stimulation^ 

[116] 

V4 

Deficit  in  distractor  suppression  when 
target  and  distractor  are  inside  RF  of  neuron 

[119] 

Deficit  in  distractor  suppression 

[133] 

IT/TE 

Biases  perceptual  judgement  in  visual  classification 

[53] 

No  behavioral  deficit  when  lesion  is  made  in 
infantile  monkeys 

[120] 

Bias  in  selection  of  stimulus  category 

[54] 

Deficit  in  distractor  suppression 

[119] 

Median  current  of  10.3  jjlA  (1 1.3  jjlA)  required  for 
behavioral  detection  of  stimulation^ 

[116] 

MT 

Bias  in  motion  direction  discrimination 

[122] 

Loss  in  perception  of  motion 

[124] 

Bias  in  motion  direction  during  stimulus 

[123] 

Loss  in  perception  of  motion  more  evident  in 

[132] 

presentation  but  not  during  memorizing 
period 

noisy  conditions 

Median  current  of  10.1  jxA  required  for  behavioral 
detection  of  stimulation^ 

[116] 

LIP 

Sub  and  suprathreshold  stimulation  lead  to  covert 

[47] 

Deficit  in  distractor  suppression  even  when 

[49] 

and  over  shifts  of  attention  respectively 

stimuli  are  non  overlapping  within  RF, 
contrast  with  [121] 

Bias  in  visual  selection 

[125] 

Dorsal  lesion  leads  to  occulomotor  deficits 
ventral  lesion  leads  to  attentional  and 
occulomotor  deficit 

[51] 

Affects  performance  in  tasks  requiring  spatial 
attention 

[48] 

FEF 

Enhanced  response  elicited  in  V4 

[43] 

Deficit  in  target  detection 

[46] 

Facilitation  akin  to  allocation  of  covert  attention 

[126] 

Enhanced  contrast  sensitivity  in  fovea  but 
not  periphery  (human) 

[128] 

Bias  toward  direction  of  saccade  plan  rather 

[127] 

Disruption  of  facilitation  by  saccade  plan  to 

[129] 

than  location  of  attention 

location  corresponding  with  stimulation  site 
(human) 

PFC 

Bias  in  target  selection 

[42] 

Loss  of  TD  signal  recorded  in  IT 

[40] 

Disruption  in  saccadic  activity 

[130] 

Decrease  in  behavioral  performance  when  cue 
is  frequently  switched 

[41] 

Elimination  of  acetylcholine  release  in  sensory 
cortex  after  stimulus  presentation  (rat) 

[131] 

®AII  studies  have  been  conducted  in  monkeys  unless  otherwise  denoted. 

“^RF,  receptive  field;  SC,  superior  colliculus;  LGN,  lateral  geniculate  nucleus;  IT,  inferotemporal  cortex;  MT,  middle  temporal  area;  LIP,  lateral  intraparietal  area;  FEF,  frontal 
eye  fields;  PFC,  prefrontal  cortex. 

‘^Stimulation  current  values  reported  in  two  monkeys  (see  [116]  for  details). 
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Figure  3.  Role  of  the  RFC  in  mediating  TD  attentional  signals,  (a)  Posterior-split-brain  paradigm.  In  this  study,  monkeys  had  to  associate  stimulus  A  with  stimulus  B  [40]. 
Stimulus  A  was  then  presented  as  the  cue,  followed  by  a  probe  stimulus,  and  the  task  was  to  release  a  lever  when  the  probe  matched  the  associated  stimulus  B.  Neurons  in 
the  inferotemporal  cortex  (IT)  were  recorded  in  one  hemisphere  while  the  cue  was  presented  either  contralateral  to  the  recording  site  (BU  condition,  i.e.  information  about 
the  cue  could  reach  area  IT  directly;  top-left  panel)  or  ipsilateral  (TD  condition,  i.e.  information  about  the  cue  could  only  reach  area  IT  via  the  anterior  corpus  callosum;  top- 
right  panel).  The  bottom  panels  show  neural  responses  in  BU  (black  trace)  and  TD  (blue  trace)  conditions.  The  left-hand  plot  shows  the  responses  after  a  posterior  split, 
demonstrating  how  cue  information  could  reach  area  IT  in  both  the  BU  and  TD  conditions.  The  right-hand  plot  shows  complete  abolition  of  the  TD  signal  after  a  full  split  of 
the  corpus  callosum  (CC).  This  is  one  of  the  clearest  demonstrations  of  the  two  different  types  of  signal,  BU  and  TD,  recorded  in  a  visual  area.  Reproduced  with  permission 
from  [40].  (b)  3D  rendering  of  macaque  monkey  brain  showing  regions  involved  in  visual  processing  and  TD  attention.  The  areas  include  the  first  visual  area  (VI),  fourth 
visual  area  (V4),  medial  temporal  cortex  (MT),  lateral  intraparietal  cortex  (LIP),  frontal  eye  fields  (FEF),  inferotemporal  cortex  (IT)  and  prefrontal  cortex  (RFC).  The  blue  arrow 
shows  the  pathway  for  TD  signals  investigated  in  the  experiment  shown  in  (a).  Rendering  of  the  brain  was  done  using  a  macaque  atlas  data  set  [135]  processed  using  the 
Caret  software  [136]. 


those  observed  after  inactivation  of  the  FEF  and  SC.  They 
thus  concluded  that  the  LIP  might  influence  or  modulate 
the  motor  decision  but  that  the  flnal  decision  is  made  by 
more  downstream  structures  such  as  the  SC  and  FEF.  This 
coincides  with  the  view  that  attentional  selection  might 
indeed  be  separate  from  motor  selection  [23] .  As  discussed 
previously,  the  diversity  of  properties  exhibited  by  LIP 
neurons  might  reflect  the  fact  that  it  encodes  priority 
without  regard  for  what  caused  the  priority,  BU  or  TD 
influences. 

We  now  consider  feature-selective  visual  areas  VI,  V2, 
V4,  MT  and  IT.  These  visual  processing  areas  drive  BU 
attentional  signals  and  are  targets  for  TD  attentional 
biasing  signals.  For  accurate  biasing  of  sensory  signals, 
speciflc  local  circuitry  and  the  nature  and  size  of  receptive 
flelds  in  each  of  these  areas  must  constrain  the  nature  and 
granularity  of  TD  signals.  Two  types  of  feedback  signals 
from  higher  cortical  regions  or  thalamus  can  influence  the 
visual  processing  areas  [52].  One  type  of  feedback  signal 
can  flow  between  a  higher  visual  processing  area  to  a  lower 
one  within  the  visual  processing  hierarchy  (Figure  4b). 
Another  type  of  feedback  signal  can  flow  between  an 


attention  area  such  as  the  FEF  and  a  processing  area  such 
as  area  V4.  Figure  4a  presents  data  from  a  study  that 
demonstrates  a  speciflc  example  of  this  type  of  feedback 
signal  [43] .  The  flow  of  TD  attentional  signals  from  the 
PFC  to  area  IT  is  another  example  of  how  TD  attentional 
signals  from  higher  cortex  can  influence  a  feature  sensitive 
sensory  area  [40].  Microstimulation  in  area  IT  results  in 
biases  of  object  recognition  [53],  or  even  of  face  detection 
when  microstimulating  face-selective  sites  within  area  IT 
[54] .  The  striate  and  extrastriate  cortices,  therefore,  are  all 
amenable  to  modulation  by  TD  attention  through  feedback 
connections  from  higher  to  lower  visual  areas. 

In  summary,  TD  signals  can  emerge  from  the  PFC  to 
bias  visual  cortices  through  direct  connections,  such  as 
from  the  PFC  to  area  IT,  or  possibly  through  the  pulvinar 
(see  below).  Similarly,  there  is  evidence  that  a  direct 
connection  from  the  FEF  to  area  V4  might  exist,  which 
further  demonstrates  the  possible  communication  of  TD 
information  from  higher  cortex  to  sensory  areas.  TD  sig¬ 
nals  from  the  PFC  probably  contain  detailed  information 
about  the  target  and  this  information  might  be  used  to  bias 
feature-selective  areas  of  sensory  cortex.  The  FEF  and  LIP, 
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Figure  4.  Role  of  feedback  from  higher  to  lower  cortical  areas  in  mediating  attention  and  perception,  (a)  Neuronal  activity  from  visual  area  V4  was  recorded  in  monkeys 
simultaneously  as  the  frontal  eye  field  (FEF)  was  microstimulated  (top  panel).  Histogram  of  neuronal  activity  in  area  V4  (bottom  panel)  in  the  control  condition  (black)  and 
the  stimulation  condition  (red).  Clear  enhancement  of  the  response  is  evident  after  FEF  stimulation.  This  demonstrates  the  role  of  frontal  areas  in  modulating  responses  in  a 
sensory  visual  area  such  as  area  V4.  Reproduced  with  permission  from  [43].  (b)  Visual  area  V5  in  human  subjects  was  stimulated  with  suprathreshold  transcranial  magnetic 
stimulation  (TMS)  pulses,  followed  by  subthreshold  TMS  stimulation  of  visual  area  VI  [137].  The  top  panel  shows  the  TMS  paradigm  used.  The  bottom  panel  shows  a  plot 
of  subjective  report  by  human  subjects  of  phosphene  perception  resulting  from  TMS  stimulation  as  a  function  of  time  lag  between  VI  and  V5  stimulations  (negative  X 
values  correspond  to  area  VI  stimulation  before  area  V5).  A  Y-value  of  1.0  indicates  that  the  subject  perceived  that  a  phosphene  was  present  and  moving;  a  value  of  2.0 
indicates  that  a  phosphene  was  present  but  the  subject  was  uncertain  of  motion;  and  a  value  of  3.0  indicates  that  the  subject  could  see  the  phosphene  but  it  was  stationary. 
Results  show  that  disruption  of  VI  activity  between  5  and  45  ms  after  V5  stimulation  results  in  the  absence  of  motion,  which  thus  demonstrates  the  importance  of  feedback 
signals  to  early  visual  areas  for  the  perception  of  motion.  Reproduced  with  permission  from  [137]. 


in  particular,  might  host  spatial  maps  encoding  the  behav¬ 
ioral  relevance  of  visual  space  dependent  on  both  BU  and 
TD  factors. 

Subcortical  influences  on  TD  attention 

Evidence  suggesting  that  cortical  areas  have  a  strong 
influence  on  attention  was  discussed  in  the  previous  sec¬ 
tion.  There  are  also  several  subcortical  areas  that  play  a 
crucial  role  in  deflning  and  communicating  attentional 
signals  (Figure  5a).  It  has  been  demonstrated  that  the 
phenomenon  of  change  blindness,  in  which  changes  to  a 
particular  part  of  a  visual  scene  go  undetected,  could  be 
eliminated  in  monkeys  by  placing  an  attention-grabbing 
salient  stimulus  in  the  location  where  the  blindness  occurs 
[55] .  Interestingly,  the  same  effects  were  also  observed  by 
microstimulating  the  SC  where  receptive  flelds  overlapped 
with  the  region  of  blindness  [55] .  This  demonstrated  that 
stimulation  of  the  SC  is  equivalent  to  adding  salience  to  a 
region  of  space;  in  other  words,  the  SC  can  strongly  bias 
attentional  deployment.  Another  study  demonstrated  en¬ 
hanced  behavioral  performance  on  a  perceptual  task  with 
stimuli  at  locations  corresponding  to  the  site  of  stimulation 
in  the  SC  [56],  mimicking  the  effects  of  a  shift  of  attention. 

In  another  study,  microstimulation  of  the  SC  in  mon¬ 
keys  led  to  a  bias  in  target  selection  decisions  [57],  which 
demonstrates  that  the  SC  is  also  involved  in  target  selec¬ 
tion.  Conversely,  inactivation  of  the  SC  led  to  target  selec¬ 
tion  errors  [58].  The  SC  is  therefore  involved  in  both 


attentional  selection  and  saccadic  behavior.  One  study 
was  able  to  elegantly  dissociate  saccade  preparation  sig¬ 
nals  from  attentional  signals  [59],  which  clarifled  any 
ambiguity  about  the  dual  roles  of  the  SC  in  occulomotor 
behavior  and  attentional  control.  This  study  involved  re¬ 
cording  from  visual,  visuomotor,  and  motor  neurons  in 
monkey  SC.  This  revealed  that  visuomotor  neurons  encode 
the  shift  of  covert  attention  (Figure  5b).  It  has  also  been 
shown  that  the  SC  is  involved  in  gating  covert  attention 
signals  used  for  making  perceptual  decisions  by  higher 
cortical  areas  [60]. 

The  SC  connections  to  the  FEF  and  LIP,  together  with 
its  role  as  an  occulomotor  structure,  make  it  an  important 
structure  in  mediating  covert  and  overt  attention.  Further¬ 
more,  given  its  direct  involvement  in  occulomotor  behavior, 
it  has  been  suggested  that  the  SC  could  host  the  flnal 
priority  map  that  guides  attention  based  on  a  fusion  of  TD 
and  BU  attentional  signals  received  from  cortex  and  else¬ 
where  [14]. 

Moving  up  the  neuraxis  to  the  thalamus,  three  impor¬ 
tant  nuclei  associated  with  visual  functions  are  found:  the 
LGN,  the  thalamic  reticular  nucleus  (TRN)  and  the  pulvi- 
nar  nucleus.  The  LGN  and  TRN  modulate  their  signals  in  a 
reciprocal  manner  (Figure  5c).  When  monkeys  attended 
inside  the  receptive  fleld  of  the  recorded  TRN  neuron,  the 
responses  of  this  cell  were  reduced,  whereas  responses  in 
the  LGN  were  enhanced  [18] .  This  reciprocal  response  in 
the  TRN  and  LGN  neurons  was  found  in  the  initial  phase  of 
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Figure  5.  Role  of  subcortical  structures  in  attention,  (a)  Schematic  drawing  of  circuitry  that  has  been  proposed  to  be  involved  in  the  generation  of  eye  movements  towards 
locations  of  reward  ([78]).  The  cortex  sends  excitatory  inputs  to  both  the  superior  colliculus  (SC)  and  the  caudate  nucleus  (CD).  The  CD  in  turn  inhibits  the  substantia  nigra 
pars  reticulata  (SNr),  which  then  reduces  its  tonic  inhibition  on  the  SC.  A  disinhibited  SC  enables  eye  movements  to  be  made.  Reproduced  with  permission  from  [78].  (b) 
Neuronal  activity  from  a  visuomotor  cell  in  the  SC.  Monkeys  were  first  presented  with  a  spatial  cue,  followed  by  an  oriented  stimulus  at  the  cued  location.  Monkeys  then 
made  a  saccade  in  the  direction  corresponding  to  the  orientation  of  the  stimulus.  The  orientation  was  always  orthogonal  to  the  location  of  the  cue,  and  this  dissociates 
shifts  of  attention  from  saccadic  behavior.  The  plot  shows  responses  of  a  visuomotor  SC  cell,  which  shows  significant  activity  in  the  attention  shift  period  (between  the 
dashed  lines)  that  occurs  immediately  after  presentation  of  the  cue,  whereas  purely  motor  cells  in  the  deeper  layers  of  the  SC  did  not  show  such  a  response  (data  not 
shown).  Reproduced  with  permission  from  [59].  (c)  Neuronal  activity  recorded  from  the  thalamus  in  awake  behaving  monkeys.  The  monkeys  were  presented  with  a  central 
cue  that  instructed  them  to  attend  to  one  of  two  peripheral  oriented  bar  stimuli,  one  inside  the  receptive  field  (RF)  of  a  recorded  neuron  and  one  outside  the  RF.  The  top 
shows  the  spike  density  of  a  magnocellular  lateral  geniculate  nucleus  (LGNm)  neuron  that  exhibits  an  enhanced  response  when  the  monkey  attends  to  a  stimulus  inside  the 
RF  (ATTin  condition)  of  the  neuron  compared  to  when  the  monkey  attends  to  a  stimulus  outside  the  RF  (ATTout  condition).  The  bottom  shows  responses  in  the  thalamic 
reticular  nucleus  (TRN),  which  responds  in  a  reciprocal  manner  to  the  LGNm  neuron  exhibiting  an  enhanced  response  when  attention  is  allocated  to  a  stimulus  outside  the 
RF.  Therefore,  the  TRN  might  gate  responses  in  the  LGN.  Reproduced  with  permission  from  [18]. 


the  response  to  a  visual  stimulus.  In  a  later  phase,  the  TRN 
response  remained  unchanged,  but  attention  further  en¬ 
hanced  responses  in  LGN.  These  results  suggest  that  (i) 
the  TRN  serves  as  the  initiator  of  modulation  in  the  LGN 
and  (ii)  attentional  modulation  begins  at  an  early  stage  in 
the  LGN.  The  TRN  therefore  plays  a  crucial  role  in  modu¬ 
lating  visual  signals  at  a  very  early  stage  of  processing. 

The  pulvinar  is  a  hyperconnected  nucleus  of  the  thala¬ 
mus  that  has  been  implicated  in  the  function  of  visual 
attention  based  on  anatomical  [17,61-63],  physiological 
[64],  lesion  [65-68]  and  computational  [69]  studies.  It 
has  been  shown  that  a  monkey’s  ability  to  suppress  dis- 
tractors  is  diminished  when  the  pulvinar  is  pharmacologi¬ 
cally  inactivated  via  administration  of  muscimol  [64]. 
Relay  neurons  have  also  been  identified  in  the  pulvinar 
by  microstimulating  the  SC  and  area  MT  while  simulta¬ 
neously  recording  from  cells  in  the  pulvinar  [70].  This 
study  adds  to  evidence  of  a  subcortical  route  for  visual 
signals  to  reach  higher  cortex  via  the  pulvinar.  At  the  same 


time,  its  bidirectional  connections  with  higher  cortical 
areas  make  it  a  potentially  important  structure  in  mediat¬ 
ing  TD  signals.  However,  the  pulvinar  remains  an  under- 
studied  nucleus,  and  further  studies  on  this  particular 
brain  nucleus  are  warranted. 

Subcortical  structures,  therefore,  both  modulate  signals 
in  areas  encoding  BU  and  TD  information,  such  as  the  LIP 
and  FEF,  and  receive  TD  information  from  higher  cortical 
areas,  directly  or  possibly  through  the  pulvinar.  The  SC 
itself  is  believed  to  host  a  priority  map,  but  this  priority 
map  might  have  closer  correspondence  to  representations 
needed  for  motor  decisions,  including  occulomotor  behavior 
and  head  movements.  Thalamic  nuclei,  including  the  LGN 
and  TRN,  modulate  visual  signals  early  on,  before  they 
reach  cortex,  and  the  pulvinar  might  be  a  key  relay  in 
communicating  attentional  signals  from  one  region  to 
another.  Subcortical  structures  are  also  heavily  involved 
and  infiuenced  by  reward  and  emotion,  as  discussed  in  the 
following  section. 
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One  emerging  theme  is  that  disparate  modes  of  proces¬ 
sing  might  exist  in  the  different  brain  regions  identified 
above.  Areas  such  as  the  LIP  and  FEF,  and  subcortical 
structures  such  as  the  SC,  might  normally  operate  in  a 
feature-agnostic  mode,  encoding  salience  and  facilitating 
or  inhibiting  regions  of  visual  space  according  to  behavioral 
goals,  but  without  regard  to  detailed  visual  features.  (This 
does  not,  however,  preclude  these  areas  from  developing 
feature  selectivity  through  operant  training  [16],  condi¬ 
tioning  [71]  or  task  demands  [72]).  Conversely,  visual 
cortices  (areas  VI,  V2,  V4),  the  IT  and  the  PFC  might  be 
operating  in  a  feature-committed  mode,  modulating 
responses  depending  on  the  exact  visual  features  that  give 
rise  to  BU  salience  and/or  TD  relevance.  The  pulvinar 
might  then  serve  as  a  bidirectional  translator,  converting 
fine-grained,  feature-committed  TD  signals  to  coarser, 
feature-agnostic  TD  signals  and  vice  versa.  This  dichotomy 
between  feature-agnostic  and  feature-committed  TD  sig¬ 
nals  gives  rise  to  interesting  hypotheses  about  possible 
mechanisms  in  which  TD  attention  exerts  its  infiuence  on 
neural  responses  in  sensory  cortex,  and  thus  affects  atten- 
tional  allocation  and  gaze  behavior. 

The  role  of  reward  and  emotion  in  TD  attention 

Until  recently,  studies  of  visual  attention  have  tradition¬ 
ally  tended  to  avoid  non-visual  aspects  of  cortical  and 
subcortical  neuronal  responses  to  manipulations  of  atten¬ 
tion.  This  has  begun  to  change  with  a  small  number  of 
psychophysical  and  electrophysiological  studies  that  have 
explored  the  interplay  between  reward  and  attention. 

To  investigate  the  role  of  reward  in  modulating  atten¬ 
tion-related  responses  in  the  LIP,  stimulus  selection  has 
been  dissociated  from  motor  selection  in  monkeys  [71]. 
With  training,  LIP  neurons  exhibit  a  strong  sustained  bias 
toward  the  location  of  a  conditioned  stimulus,  even  when  a 
saccade  in  the  opposite  direction  was  required  to  reveal  the 
outcome  of  the  trial.  This  suggests  that  LIP  neurons  encode 
The  value  of  information’  [23]  and  prioritize  spatial  loca¬ 
tions  based  on  this  value. 

Studies  using  operant  conditioning  paradigms  demon¬ 
strate  effects  related  to  improvements  in  the  volitional  TD 
process.  However,  learning  is  also  the  primary  method  for 
augmenting  the  mandatory  TD  process.  It  has  been  shown 
that  the  FEF  develops  systematic  biases,  akin  to  a  man¬ 
datory  TD  signal,  thereby  facilitating  shifts  of  attention  in 
the  direction  of  the  feature  when  it  is  present  at  any 
location  [73,74].  More  recently,  a  similar  tendency  was 
found  in  humans  performing  a  visual  search  task  in  which 
the  target  changed  on  every  trial,  which  therefore  preclud¬ 
ed  subjects  from  simply  learning  a  limited  set  of  target 
features  [75].  Subjects’  performance  improved,  demon¬ 
strating  an  improved  ability  to  quickly  extract  information 
from  a  brief  preview  of  the  target  before  each  trial,  and  to 
then  use  this  information  to  shape  TD  signals  and  guide 
attention.  Learning  and  reward  paradigms  can  therefore 
infiuence  ability  to  both  generate  TD  biasing  signals  (i.e. 
volitional  TD  process)  and  introduce  systematic  biases  (i.e. 
mandatory  TD  process). 

Reward  plays  an  important  role  in  modulating  atten- 
tional  signals,  and  the  basal  ganglia,  which  consist  of 
dopaminergic  nuclei  in  the  substantia  nigra  pars  reticulata 


(SNr),  the  caudate  and  the  putamen,  are  essential  in 
encoding  reward  signals  [76].  The  basal  ganglia  are  inte¬ 
grally  connected  to  the  occulomotor  system  through  the 
connection  of  the  SC  to  the  SNr  [77].  Reward  signals  (TD) 
from  frontal  cortices  are  transmitted  to  the  caudate,  which 
then  inhibits  the  SNr,  which  in  turn  pauses  the  tonic 
inhibition  from  the  SNr  to  the  SC,  releasing  it  from  inhibi¬ 
tion  and  enabling  saccades  [78].  This  follows  a  more  gen¬ 
eral  scheme  in  the  CNS  in  which  the  basal  ganglia  circuit 
continually  inhibits  movement  of  all  limbs  until  an  explicit 
command  to  make  a  motor  movement  is  received  from 
cortical  or  subcortical  regions.  Furthermore,  it  is  also 
possible  that  reward  plays  a  strong  role  in  infiuencing  a 
subcortical  salience  map  that  can  cause  instant  occulomo¬ 
tor  refiexes. 

A  recent  study  has  shed  new  light  on  the  SNr  to  SC 
connection  by  demonstrating  that  SNr  fibers  connect  not 
only  to  excitatory  neurons  in  the  SC,  but  also  to  local 
GABAergic  neurons  in  the  intermediate  layers  of  the  SC 
[79].  Therefore,  the  SNr  is  involved  in  shaping  the  balance 
of  inhibition  and  excitation  in  the  local  SC  circuit.  SC 
involvement  in  attentional  selection  and  the  strong  role 
of  the  SNr  in  reward  render  the  SNr-SC  connection  an 
important  one  because  in  most  studies,  especially  physio¬ 
logical  studies  in  monkeys,  paradigms  are  based  on  the 
elements  of  operant  conditioning  and  reinforcement  learn¬ 
ing  with  a  crucial  role  for  reward  (see  [78,80]  for  more 
detailed  discussions). 

Sensory  processing  is  also  amenable  to  modulation  by 
brain  regions  encoding  emotions.  In  particular,  it  is  known 
that  the  amygdala  has  reciprocal  connections  with  both 
early  and  late  visual  areas  and  can  thus  give  priority, 
through  modulation,  to  stimuli  of  ecological  relevance 
[81] .  Using  a  combination  of  functional  magnetic  resonance 
imaging  (fMRI)  and  a  study  of  lesion  patients,  it  was  found 
that  visual  areas  such  as  the  fusiform  gyrus  receive  input 
from  the  amygdala  and  exhibit  enhanced  responses  to 
affective  stimuli  [82] .  Such  modulation  by  emotion  matches 
response  enhancement  observed  through  attentional  allo¬ 
cation.  Furthermore,  it  has  been  shown  that  emotional  and 
attentional  modulations  can  act  independently,  as  ob¬ 
served  in  patients  with  lesions  of  the  amygdala,  whose 
fusiform  cortex  exhibited  responses  modulated  by  atten¬ 
tion  but  not  emotion  [82].  Affective  stimuli  can  therefore 
impinge  on  sensory  signals  independently  of  attention; 
however,  the  very  enhancement  due  to  emotional  valence 
might  render  the  stimuli  salient  and  thus  draw  more 
attention.  Attention  and  emotion  might  thus  act  indepen¬ 
dently  on  the  sensory  signals  and  the  behavioral  relevance 
of  these  sensory  inputs  might  be  determined  by  the  cumu¬ 
lative  effects  of  both  attention  and  emotion. 

One  proposal  for  neural  mechanisms  and  regions  in¬ 
volved  in  fusion  of  affective  inputs  with  purely  visual 
aspects  driving  attention  has  recently  been  suggested 
based  on  a  search  task  in  human  subjects  using  fMRI 
[83] .  The  frontoparietal  spatial  attention  network,  consist¬ 
ing  of  the  superior  parietal  lobule  (SPL),  the  inferior 
parietal  lobule  (IPL)  and  the  FEF,  was  activated  when 
the  cue  was  purely  spatial.  However,  when  the  cue  con¬ 
tained  both  spatial  and  emotional  information,  limbic  and 
subcortical  structures  including  the  posterior  cingulate 
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cortex  (PCC),  the  amygdala  and  the  orbitofrontal  cortex 
were  activated,  in  addition  to  the  frontoparietal  network. 
This  study  also  found  selectivity  in  the  PCC  for  responding 
only  to  cues  that  had  emotional  valence  [83] .  These  results 
suggest  that  the  cingulate  gyrus,  which  receives  inputs 
from  the  amygdala  and  sends  outputs  to  the  frontoparietal 
network,  might  serve  as  the  gateway  for  affective  inputs  to 
fuse  with  spatial  biasing  signals.  This  gives  rise  to  a  TD 
salience  map  in  the  frontoparietal  network,  complete  with 
affective  and  spatial  priority  information. 

Although  evidence  remains  limited,  a  number  of  studies 
have  demonstrated  links  between  the  attentional  network 
and  reward  and  emotional  centers.  Such  connections  must 
be  taken  into  account  when  considering  TD  networks, 
because  most  experimental  paradigms  involving  TD  atten¬ 
tion  to  date  have  used  reward  and/or  emotional  valence  to 
train  and  motivate  human  or  animal  participants. 

The  role  of  oscillatory  activity  and  neuromodulation  in 
TD  attention 

It  has  recently  been  suggested  that  synchronous  activity  (in 
the  gamma  range,  50-80  Hz)  between  cortical  regions  might 
serve  as  the  basis  for  attentional  facilitation  and  cortical 
computations  [84].  In  this  proposal,  neuronal  populations 
representing  inputs  and  decision  centers  all  consist  of  rhyth¬ 
mically  active  neural  ensembles  with  distinct  excitatory  and 
inhibitory  phases.  Inhibitory  interneurons  in  each  ensemble 
rhythmically  inhibit  excitatory  pyramidal  neurons,  thereby 
establishing  a  rhythm.  Two  neural  ensembles  can  then 
synchronize  through  phase-locking.  This  gives  rise  to  a 
winner-take-all  mechanism  among  two  competing  inputs 
feeding  into  a  single  higher  cortical  decision  area,  through 
synchronization  between  the  higher  area  and  one  selected 
input.  Synchrony  between  the  input  and  higher  areas  can  be 
established  in  a  TD  or  BU  manner.  In  the  TD  case,  a  region 
in  higher  cortical  regions  might  establish  a  gamma  synchro¬ 
ny  with  a  lower  sensory  area  by  phase-locking. 

Data  from  several  studies  demonstrate  that  gamma 
oscillations  in  the  cortex  are  correlated  with  attention 
[45,85].  Disparate  brain  regions  might  synchronize  their 
activity  in  the  gamma  band  when  an  animal  is  attending  to 
a  particular  stimulus.  A  specific  example  of  this  type  of 
coupling  is  that  observed  between  the  FEF  and  area  V4  in 
monkeys.  When  attending  to  a  stimulus,  coupling  through 
gamma  oscillations  during  attention  was  observed  be¬ 
tween  neurons  in  the  FEF  and  V4  [45].  Oscillations  in 
lower  frequency  bands,  such  as  the  alpha  and  delta  bands, 
have  also  been  implicated  in  sensory  selection  [86].  Spe¬ 
cifically,  in  the  presence  of  rhythmic  stimuli,  delta  band 
oscillations  in  visual  cortex  entrain  to  the  rhythm  of  the 
stimuli  [86] .  In  doing  so,  periods  of  excitability  in  sensory 
cortex  are  aligned  with  events  in  the  attended  stream.  In 
this  manner,  behaviorally  relevant  events  in  the  input  can 
be  detected  more  reliably.  The  same  study  also  showed  that 
the  phase  of  the  low-frequency  band  can  modulate  ampli¬ 
tudes  in  higher-frequency  bands,  such  as  the  gamma  band 
essential  for  attention.  Thus,  oscillations  in  both  the  gam¬ 
ma  and  lower-frequency  bands  are  essential  neural 
mechanisms  for  sensory  selection  and  attention. 

The  neurochemical  basis  for  attention  further  supports 
the  notion  that  synchrony  is  a  possible  mechanism  for  TD 


attention.  Several  studies  have  described  acetylcholine 
(ACh)  as  the  major  neurotransmitter  involved  in  mediat¬ 
ing  attention  at  the  neuronal  level  [87].  Using  pharmaco¬ 
logical  manipulations,  it  was  found  that  attentional 
modulation  in  area  VI  could  be  enhanced  by  low  doses 
of  ACh  [88].  Furthermore,  injection  of  a  muscarinic  ACh 
receptor  (mAchR)  antagonist  eliminated  such  facilitation, 
but  a  nicotinic  ACh  receptor  (nAChR)  antagonist  did  not. 
This  demonstrates  that  ACh  acts  through  mAChRs  to 
modulate  attention.  Such  modulation  might  enhance  pro¬ 
cessing  in  sensory  areas,  a  property  of  TD  attention.  It  has 
been  demonstrated  that  pharmacological  modulation  of 
glutamatergic  transmission  in  the  PFC  causes  an  increase 
in  cholinergic  release  in  the  PPC  [89] .  Given  the  evidence 
from  the  studies  discussed  above,  it  is  reasonable  to  hy¬ 
pothesize  that  one  neurochemical  process  by  which  the 
PFC  could  be  involved  in  TD  biasing  is  modulation  of  ACh 
release  in  sensory  areas. 

One  method  that  has  been  suggested  for  achieving 
gamma  synchrony  is  the  disinhibition  of  pyramidal  cells 
from  inhibitory  interneuron  activity  through  cholinergic 
inputs  [90].  This  suggests  that  the  cholinergic  system 
might  also  give  rise  to  the  gamma  synchrony  correlated 
with  attention  [84] .  Taken  together,  this  evidence  suggests 
that  one  possible  mechanism  involved  in  the  selection  of 
relevant  sensory  stimuli  is  via  modulation  of  ACh  by 
higher  cortical  regions,  such  as  the  PFC,  onto  sensory 
cortical  regions,  which  in  turn  would  induce  more  powerful 
gamma  synchronies  between  sensory  and  higher  cortical 
regions.  However,  it  is  currently  unclear  whether  gamma 
synchrony  modulation  or  firing  rate  modulation  is  the  core 
mechanism  involved  in  TD  attention.  This  question  was 
addressed  using  a  biophysically  realistic  computational 
model  of  a  single  layer  of  visual  cortex  receiving  attentional 
inputs  [91].  The  model  of  the  visual  cortex  consisted  of 
neurons  with  glutamatergic  synapses.  These  synapses 
were  modeled  with  two  types  of  glutamate  receptors, 
AMPA  and  NMDA.  Modulation  of  the  ratio  of  AMPA  to 
NMDA  receptor  conductance  gave  rise  to  both  firing  rate 
and  gamma  synchrony  modulation  in  an  independent 
manner.  This  suggests  that  TD  attention  might  be  able 
to  regulate  these  two  systems  in  an  independent  manner  to 
set  or  modify  gain  in  sensory  areas.  Despite  the  paucity  of 
conclusive  empirical  evidence,  neural  gamma  synchrony 
and  the  concept  of  glutamatergic  modulation  in  PFC  giving 
rise  to  ACh  modulation  in  sensory  areas  provide  a  compel¬ 
ling  potential  neural  mechanism  for  TD  attention. 

Computational  modeling 

Physiological  studies  have  guided  several  theoretical  and 
computational  models  of  attention.  Building  on  the  infiu- 
ential  feature  integration  theory  [2] ,  guided  search  theory 
hypothesizes  that  massively  parallel  pre-attentive  process¬ 
es  can  be  guided  by  TD  biasing  for  features  and  locations 
[92] .  This  theory  brings  TD  elements  to  a  basic  BU  model  of 
attention  [93],  which  computes  individual  features  at  dif¬ 
ferent  scales  and  then  combines  these  features  to  form  a 
saliency  map.  A  unifying  normalization  model  of  attention 
has  recently  been  proposed  and  accounts  for  many  effects  of 
TD  attention  onto  visual  areas  (Figure  6a)  [37].  In  this 
model,  the  neuronal  population  response  of  sensory  cortex 
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Figure  6.  Connputational  modeling  of  TD  attention,  (a)  Model  of  attention  processing  inspired  by  the  normalization  model  of  attention  [37].  A  visual  stimulus  can  be 
processed  by  early  visual  processing  stages  and  this  gives  rise  to  stimulus  drive.  Stimulus  drive  can  then  be  combined  with  an  attention  field  that  can  provide  TD 
modulation  over  space.  Although  the  model  does  not  specify  how  the  attention  field  is  formed,  we  hypothesize  that  TD  influences  are  responsible  for  this.  Note  how  some 
TD  signals  might  directly  modulate  or  shape  the  stimulus  drive  (e.g.  by  shifting  receptive  fields  or  affecting  orientation  preference).  After  combination  with  the  attentional 
field,  responses  undergo  divisive  normalization  and  contrast  gain  control  before  outputting  the  final  response.  Figure  adapted  with  permission  from  [37].  Example  input 
frame  provided  by  Daniel  Simons  [138].  (b)  A  model  based  on  related  ideas  that  provides  a  computer  implementation  applied  to  the  analysis  of  human  gaze  behavior  while 
engaged  in  complex  naturalistic  tasks  (e.g.  driving)  [97].  A  task-dependent  learner  component  builds,  during  a  training  phase,  associations  between  distinct  coarse  types  of 
scenes  and  observed  eye  movements  (e.g.  drivers  tend  to  look  to  the  right  when  the  road  turns  right).  During  testing,  exposure  to  similar  scenes  gives  rise  to  a  TD  salience 
map  (similar  to  the  attention  field  in  (a)),  which  is  further  combined  with  a  BU  salience  map  (similar  to  the  stimulus  drive)  to  give  rise  to  the  final  BU*TD  priority  map  that 
guides  attention.  Blue  diamonds  represent  the  peak  location  for  each  map  and  orange  circles  represent  the  current  eye  position  of  a  human  player.  Figure  adapted  with 
permission  from  [97]. 


to  a  stimulus  is  determined  by  a  competitive  normalization 
process  that  combines  stimulus  drive,  suppressive  drive 
and  an  attention  field.  Although  this  model  successfully 
captures  a  wide  range  of  single-unit  observations,  it  does 
not  elucidate  how  the  attention  field  is  formed.  This  con¬ 
cept  is  related  to  the  idea  of  a  task-relevance  map,  a 
topographical  map  of  visual  space  that  might  highlight 
locations  or  features  of  current  behavioral  relevance  and 
might  then  act  as  a  mask  or  filter  over  the  BU  salience  map 
[94] .  The  task-relevance  map  might  be  populated  by  com¬ 
bining  information  about  desired  features  (e.g.  look  for  red 
items),  cued  spatial  locations  (e.g.  instructions  that  the 
target  is  to  the  right),  scene  gist  and  context  (e.g.  when 
looking  for  a  stapler  in  an  office,  focus  first  on  desktops), 
short-term  memory  of  objects  and  features  at  previously 
visited  locations,  and  TD  expectations  arising  from  reason¬ 
ing  about  what  has  been  discovered  so  far  in  light  of  the 
task  (e.g.  if  searching  for  a  computer  mouse,  finding  a 
keyboard  and  reasoning  that  the  owner  of  the  machine 
might  be  right-handed  might  bias  attention  to  the  right  of 
the  keyboard)  [94] .  Interestingly,  recent  human  neuroim¬ 
aging  data  provide  direct  support  for  such  task  relevance  or 
TD  salience  map  possibly  located  in  the  intraparietal 
sulcus  (IPS).  Indeed,  it  has  been  shown  that  the  latter 
combines,  into  a  single  topographic  (or,  at  least,  latera- 
lized)  map,  information  about  both  TD-relevant  locations 
and  TD-relevant  features  [95],  and  emotional  or  motiva¬ 
tional  value  of  a  cued  target  [83,96].  In  a  biologically 
inspired  large-scale  computer  vision  implementation,  a 


similar  combination  of  a  TD  attention  field  and  BU  saliency 
map  was  used  to  predict  eye  movements  of  humans  en¬ 
gaged  in  complex  tasks  (e.g.  combat  fiying  or  first-person 
exploration  video  games)  [97].  Given  the  complexity  of 
these  tasks  and  the  multiple  interacting  TD  goals  involved, 
this  model  did  not  attempt  to  fully  analyze  and  recognize  all 
objects  in  scenes  and  to  assess  them  in  light  of  the  task 
goals.  Instead,  the  TD  map  was  obtained  from  learned 
associations  between  particular  types  of  scenes  (summa¬ 
rized  by  a  simple  vector  of  features  capturing  their  gist)  and 
the  locations  that  humans  looked  at  when  engaged  in  the 
same  task  and  exposed  to  similar  visual  scenes  (Figure  6b). 

At  one  extreme,  TD  attention  signals  might  just  consist 
of  a  single  bit  of  information  -  to  ‘enhance’  or  not  -  with 
target  visual  areas  interpreting  it  in  different  manners 
depending  on  context  and  on  visual  inputs.  One  advantage 
of  such  a  solution  is  the  low  TD  communication  bandwidth, 
but  an  obvious  drawback  is  the  infiexibility  of  signal  con¬ 
tent.  At  the  other  extreme,  the  brain  areas  where  TD 
signals  originate  might  address  every  sensory  neuron 
individually  and  explicitly  modulate  the  neuron’s  activity; 
for  example,  increasing  gain  by  some  specific  amount, 
sharpening  tuning,  and  increasing  baseline  activity.  Such 
a  scheme  would  afford  maximal  fiexibility,  but  at  the  cost  of 
both  enormous  TD  communication  bandwidth  and  high 
computational  requirements  in  areas  where  TD  signals 
originate,  to  compute  the  exact  values  for  all  these  signals. 
The  true  nature  of  TD  signals  is  likely  to  lie  between  these 
two  extremes,  as  further  elaborated  below. 
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The  Guided  Search  model  lies  towards  the  low-band- 
width  end  of  the  spectrum,  with  TD  signals  imposing 
spatial  attention  modulation  over  coarse  regions  of  visual 
space  and  coarse  visual  features  (e.g.  a  single  TD  attention 
weight  for  each  of  red,  green,  blue  or  yellow  colors,  or  steep, 
shallow,  left  or  right  orientations)  [92] .  Two  recent  studies 
have  refined  this  proposal.  First,  in  human  eye-tracking 
experiments  it  has  been  shown  that  attention  and  gaze  can 
effectively  be  guided  towards  rather  fine  sub-bands  of  basic 
visual  features,  such  as  mid-luminance  items  among  low- 
and  high-luminance  items,  and  similarly  for  size  and  color 
saturation  [98].  Furthermore,  these  results  have  been 
formalized  with  a  signal-to-noise  ratio  (SNR)-maximizing 
model  for  feature  search,  whereby  the  TD  gain  applied  to 
each  sensory  neuron  is  proportional  to  its  ability  to  distin¬ 
guish  the  target  of  behavioral  interest  from  background 
clutter  [99] .  Taken  together,  these  two  studies  suggest  that 
the  bandwidth  or  granularity  of  TD  signals  is  unlikely  to  be 
extremely  low,  but  rather  might  consist  of  at  least  a  few 
bits  for  each  fine-grained  feature  sub-band,  sufficient  to 
convey  optimal  biases  from  the  top  down.  The  bandwidth 
(and  number  of  descending  connections)  might  be  higher  if 
different  biases  can  be  communicated  to  different  locations 
of  sensory  space.  At  the  high  extreme,  the  aforementioned 
normalization  model  of  attention  assumes  a  highly  de¬ 
tailed  attention  field  over  space  and  features  [37] ,  implying 
high-bandwidth  TD  signals. 

Beyond  the  nature  and  bandwidth  of  information  con¬ 
veyed  from  the  top  down,  computational  models  have 
proposed  a  number  of  connectivity  styles  that  might  be 
embodied  in  the  biological  reality  of  TD  connections.  On 
the  one  hand,  one  model  has  identified  a  specific  dedicated 
structure  (the  pulvinar)  as  a  hub  or  relay  for  TD  signals  to 
reach  target  visual  areas  [69] .  On  the  other  hand,  a  more 
distributed  model  suggests  that  TD  signals  are  embedded 
within  the  visual  areas  themselves  [100].  In  this  model,  a 
stimulus  is  selected  at  the  top  level  based  on  an  initial 
sweep  of  feed-forward  information.  The  spatial  selection 
signals  then  propagate  back  and  tune  lower  levels  of  the 
(cortical)  visual  processing  hierarchy  through  a  cascade  of 
winner-take-all  mechanisms.  This  view  involves  retro¬ 
grade  propagation  of  signals  over  the  processing  hierarchy 
as  opposed  to  direct  connections  (or  through  one  or  a  few 
relays)  between  top  and  bottom.  A  number  of  models  also 
give  specific  roles  to  direct  or  indirect  connections  among 
different  levels  of  the  hierarchy,  for  example  between  the 
PFC,  FEF,  TE  and  V4  [101].  These  models  are  important 
because  they  develop  hypotheses  for  the  meaning  of  large- 
scale  connectivity  between  brain  areas,  and  these  are 
beginning  to  be  explicitly  tested  in  biological  networks 
using  graph-theoretic  analyses  [102].  Nevertheless,  there 
is  a  clear  lack  of  specific  computational  (and  experimental) 
studies  that  systematically  investigate  the  granularity, 
bandwidth  and  specific  wiring  of  TD  signals. 

Finally,  computational  theories  and  models  have 
started  to  provide  hypotheses  for  the  meaning  of  TD  sig¬ 
nals.  For  example,  models  based  on  feedback  connections 
from  higher  cortical  areas  have  been  placed  in  a  Bayesian 
framework,  with  the  suggestion  of  a  generative  model  that 
produces  a  hypothesis  about  a  percept  (the  prior),  then 
combines  this  with  evidence  from  BU  information  to  make 


Box  1.  Outstanding  questions 


•  What  is  the  bandwidth  of  the  TD  signal  transmitted  from  one 
region  of  the  brain  to  the  next?  Figure  I  illustrates  the  two  types  of 
signal.  A  narrow-bandwidth  signal  (yellow  arrow)  defines  single 
weights  for  individual  features,  whereas  a  broadband  signal  (blue 
arrow)  defines  the  distribution  of  gain  and  tuning  over  the  feature 
space,  as  well  as  the  interactions  within  a  feature  dimension. 

•  Are  TD  signals  relayed  to  visual  areas  through  a  central  hub  (e.g. 
the  pulvinar)  or  does  a  more  distributed  mechanism  reflect  the 
reality  of  communication  of  TD  signals  to  sensory  areas? 

•  What  is  the  representation  or  encoding  of  TD  signals?  In  concrete 
terms,  how  are  behavioral  goals  represented  and  communicated 
to  sensory  neurons  that  are  tuned  to  specific  features? 

•  What  (if  any)  computations  take  place  subcortically,  independent 
of  the  cortex,  that  would  influence  attention  modulation  of 
sensory  perception? 
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Figure  I.  Narrow-band  versus  broad-band  TD  biasing  signals.  The  TD  biasing 
signals  transmitted  from  one  area  'A'  of  the  brain  to  another  area  'B'  can  be 
either  narrow-band  (yellow  arrow)  or  broad-band  (blue  arrow)  in  nature. 
Narrow-band  signals  consist  of  a  small  set  of  weights  that  bias  feature 
preferences  in  a  coarse  manner.  The  bar  graph  shows  a  signal  that  applies  a 
higher  gain  to  neurons  tuned  to  red  rather  than  blue  in  the  color  feature 
dimension,  neurons  tuned  to  shallow  rather  than  steep  orientations,  and 
neurons  tuned  to  brighter  rather  than  darker  stimuli.  Broad-band  biasing 
signals  (bottom)  contain  a  greater  amount  of  information  and  might  facilitate 
biasing  of  features  in  a  detailed  manner,  weighing  gain,  tuning  and  feature 
interactions  independently.  Rather  than  simply  setting  a  weight  along  a  feature 
dimension,  as  is  the  case  in  the  narrow-band  example,  broad-band  TD  signals 
might  set  a  biasing  profile  along  the  feature  dimension,  as  shown  in  the 
example  graphs.  Green  curves  show  a  biasing  profile  for  gains  of  neurons 
along  a  feature  dimension.  Blue  curves  show  a  biasing  profile  of  tuning  of 
neurons;  a  peak  here  would  indicate  a  bias  or  shift  of  tuning  of  neurons  for  the 
particular  feature  value.  The  interaction  triangles  on  the  right  show  biases  for 
feature  interactions.  For  example,  along  the  hue  dimension,  there  are  two  hot 
spots,  one  indicating  a  preference  for  simultaneous  occurrence  of  yellow  and 
red  hues  and  another  indicating  a  preference  for  red  and  blue  hues. 


a  final  decision  on  the  percept  [103].  This  approach  has 
been  formalized  in  a  hierarchical  Bayesian  framework 
[104] .  Although  these  ideas  have  so  far  been  explored  more 
in  the  context  of  the  mandatory  TD  process,  they  can  also 
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be  placed  in  the  context  of  the  volitional  TD  process. 
Volitional  TD  control  could  then  be  understood  as  updat¬ 
ing,  biasing  or  disambiguating  the  prior  based  on  high- 
level  tasks,  contextual  cues  or  behavioral  goals.  In  com¬ 
puter-vision  models  using  these  principles,  it  has  indeed 
been  shown  that  TD  attention  provides  great  benefit  over 
pure  BU  processing  [105].  For  example,  TD  information 
can  more  effectively  guide  visual  search  for  specific  objects 
in  natural  scenes  (e.g.  pedestrians  in  street  scenes)  by 
limiting  the  search  to  spatial  locations  of  high  prior  or 
posterior  probabilities  [106-108].  Although  computational 
models  have  made  some  headway  in  both  incorporating 
experimental  data  and  generating  predictions  to  guide 
further  experiments,  much  remains  to  be  done  both  exper¬ 
imentally  and  theoretically  to  unravel  the  mechanisms  by 
which  TD  attentional  mechanisms  influence  BU  proces¬ 
sing  (Box  1). 

Conclusion 

Attention  modulates  sensory  signals  early  in  the  process, 
exerting  its  influence  on  the  SC  and  the  thalamus  before 
further  modulating  signals  in  cortex.  The  cumulative 
effects  of  this  modulation  based  on  both  TD  and  BU 
influences  might  be  represented  by  a  priority  map  over 
visual  space.  Although  there  is  some  debate  about  the 
exact  locus  of  the  priority  map,  it  is  clear  that  the  LIP, 
FEF  and  SC  exhibit  properties  that  are  compatible  with 
the  existence  of  a  spatial  map  encoding  behavioral  rele¬ 
vance  of  spatial  locations.  These  three  regions  might  joint¬ 
ly  compute  or  host  such  a  map  that  is  agnostic  to  the 
features  that  caused  the  priority.  Thus,  the  map  fuses  both 
BU  and  TD  influences  and  drives  motor  output. 

Higher  cortical  areas  such  as  the  PFC  send  detailed  TD 
signals  to  sensory  areas  for  biasing  of  spatial  and  non- 
spatial  features.  Such  signals  fuse  together  with  reward- 
related  and  emotional  signals  to  form  the  TD  influence  on 
attention,  which  might  be  reflected  in  the  priority  map. 
Subcortical  regions,  through  their  close  connection  to  the 
reward  systems  in  the  brain  and  their  coupling  with  motor 
systems,  exert  strong  influences  on  attentional  signals,  in 
addition  to  being  major  targets  of  attentional  modulation 
for  motor  output.  Feedback  connections  are  both  pervasive 
and  crucial  for  the  transmission  of  biasing  signals  ema¬ 
nating  from  higher  brain  regions,  especially  the  frontal 
cortices  that  are  involved  in  working  memory  processes 
and  send  descending  reward  signals.  Computational  stud¬ 
ies  highlight  the  important  constraints  on  the  nature, 
granularity,  bandwidth  and  connectivity  style  of  TD  con¬ 
nections.  There  is  a  pressing  need  to  build  models  that  take 
into  account  physiological  data,  particularly  from  micro¬ 
stimulation  and  lesion  studies,  which  could  help  to  deter¬ 
mine  the  contributions  of  specific  areas  to  the 
computations  necessary  for  attentional  guidance. 

Although  the  exact  mechanisms  of  TD  attention  have 
yet  to  be  completely  delineated,  there  are  sufficient  data 
available  to  demonstrate  that  attention  is  mediated  by  the 
merging  of  TD  and  BU  information.  As  William  James 
eloquently  stated,  ‘The  attentive  process,  therefore,  at  its 
maximum  may  be  physiologically  symbolized,  by  a  brain¬ 
cell  played  on  in  two  ways  from  without  and  from  within’ 
[109]. 
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Abstract 

We  introduce  a  saliency  model  based  on  two  key  ideas. 
The  first  one  is  considering  local  and  global  image  patch 
rarities  as  two  complementary  processes.  The  second  one 
is  based  on  our  observation  that  for  different  images,  one 
of  the  RGB  and  Lab  color  spaces  outperforms  the  other  in 
saliency  detection.  We  propose  a  framework  that  measures 
patch  rarities  in  each  color  space  and  combines  them  in 
a  final  map.  For  each  color  channel,  first,  the  input  im¬ 
age  is  partitioned  into  non- overlapping  patches  and  then 
each  patch  is  represented  by  a  vector  of  coefficients  that 
linearly  reconstruct  it  from  a  learned  dictionary  of  patches 
from  natural  scenes.  Next,  two  measures  of  saliency  (Local 
and  Global)  are  calculated  and  fused  to  indicate  saliency 
of  each  patch.  Local  saliency  is  distinctiveness  of  a  patch 
from  its  surrounding  patches.  Global  saliency  is  the  inverse 
of  a  patch’s  probability  of  happening  over  the  entire  image. 
The  final  saliency  map  is  built  by  normalizing  and  fusing 
local  and  global  saliency  maps  of  all  channels  from  both 
color  systems.  Extensive  evaluation  over  four  benchmark 
eye-tracking  datasets  shows  the  significant  advantage  of  our 
approach  over  10  state-of-the-art  saliency  models. 

1.  Introduction 

The  human  visual  system  has  to  process  an  enormous 
amount  of  incoming  information  10^  bit/s)  from  the 
retina.  Similarly,  in  computer  vision,  many  systems  suf¬ 
fer  from  the  high  computational  complexity  of  inputs,  es¬ 
pecially  when  these  systems  are  supposed  to  work  in  real 
time.  Visual  saliency  is  a  concept  that  offers  efficient  solu¬ 
tions  for  both  biological  and  artificial  vision  systems.  It  is 
basically  a  process  that  detects  scene  regions  different  from 
their  surroundings  (often  referred  as  bottom-up  saliency). 
Then,  higher  cognitive  and  usually  more  complex  opera¬ 
tions  are  focused  only  on  the  selected  areas. 

Recently,  modeling  visual  saliency  has  raised  much  in¬ 
terest  in  theory  and  applications  (see  [47]  for  a  review).  For 
example  in  computer  vision,  it  has  been  used  for  image  and 
video  compression  [49],  image  segmentation,  and  object 


Laurent  Itti 

use 

itti@usc . edu 


Image  Human  Lab  RGB  Combined 


Figure  1.  One  color  system  does  not  work  for  all  images.  Top  (Bottom): 
Sample  images  where  our  model  is  able  to  detect  the  outliers  in  CIE  Lab 
(RGB)  color  space.  For  some  images  both  color  spaces  work  equally  well. 
Last  column  shows  combined  maps  from  both  color  spaces.  Images  are 
taken  from  the  TORONTO  dataset  [14]. 

recognition  [52].  In  computer  graphics,  detecting  salient  re¬ 
gions  has  been  employed  for  content-aware  image  cropping, 
photo  collage  [50],  and  stylization  of  images  [53].  Saliency 
computation  has  also  applications  in  other  areas  such  as  ad¬ 
vertisement  design  [51]  and  visual  prosthetics  [48].  Our 
focus  in  this  paper  is  proposing  a  new  and  more  predictive 
(with  respect  to  human  eye  tracking  data)  model  of  bottom- 
up  visual  saliency  by  integrating  local  and  global  saliency 
detection  in  both  RGB  and  Lab  color  spaces  (see  Fig.  1). 

Related  works  on  saliency  modeling.  A  majority 
of  computational  models  of  attention  follow  the  structure 
adapted  from  the  Feature  Integration  Theory  (FIT)  [15]  and 
the  Guided  Search  model  [1].  Koch  and  Ullman  [19]  pro¬ 
posed  a  computational  architecture  for  this  theory  and  Itti  et 
al.  [4]  were  among  the  first  ones  to  fully  implement  and 
maintain  it.  The  main  idea  here  is  to  compute  saliency  in 
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each  of  several  features  (e.g.,  color,  intensity,  orientation; 
saliency  is  then  the  relative  difference  between  a  region  and 
its  surrounding)  in  parallel,  and  to  fuse  them  in  a  scalar  map 
called  the  “saliency  map”.  Le  Meur  et  al.  [18]  adapted 
the  Koch-Ullman’s  model  to  include  features  of  contrast 
sensitivity,  perceptual  decomposition,  visual  masking,  and 
center- surround  interactions.  Some  models  have  added  fea¬ 
tures  such  as  symmetry  [20],  texture  contrast  [36],  curved¬ 
ness  [21],  or  motion  [41]  to  the  basic  structure. 

In  addition  to  the  mentioned  cognitive  models,  several 
probabilistic  models  of  visual  saliency  have  been  developed 
over  the  past  years.  In  these  models,  a  set  of  statistics  or 
probability  distributions  are  computed  from  either  the  cur¬ 
rent  scene,  or  from  a  set  of  natural  scenes  over  space  or 
time  or  both.  Itti  and  Baldi  [10]  defined  surprising  stim¬ 
uli  as  those  which  significantly  change  beliefs  of  an  ob¬ 
server,  measured  as  the  Kullback-Leibler  (KL)  distance  be¬ 
tween  posterior  and  prior  beliefs.  Harel  et  al.  [7]  used  graph 
algorithms  and  a  measure  of  dissimilarity  to  achieve  effi¬ 
cient  saliency  computation  with  their  Graph  Based  Visual 
Saliency  (GBVS)  model.  Torralba  et  al.'s  contextual  guid¬ 
ance  model  [26]  consolidates  low-level  salience  and  scene 
context  when  guiding  search.  Areas  of  high  salience  within 
a  selected  contextual  region  are  given  higher  weights  on 
an  activation  map  than  those  that  fall  outside  the  selected 
contextual  region.  Some  Bayesian  models  formulate  visual 
search  and  derive  a  measure  of  bottom-up  saliency  as  a  by¬ 
product.  For  example,  Zhang  et  aUs  model  [12],  Saliency 
Using  Natural  statistics  (SUN),  combines  top-down  and 
bottom-up  information  to  guide  eye  movements  during  real- 
world  object  search  tasks.  However,  unlike  Torralba  et  al.'s 
model,  SUN  implements  target  features  as  the  top-down 
component.  Gao  and  Vasconcelos  [23]  define  saliency  as 
maximizing  classification  accuracy.  They  utilize  the  KL 
distance  to  measure  mutual  information  between  features  at 
a  scene  location  and  class  labels.  The  higher  mutual  infor¬ 
mation  between  a  region  and  class  of  interest,  the  higher  the 
saliency  of  that  region.  Seo  and  Milanfar  [11]  using  local 
regression  kernels  build  a  “self-resemblance“  map,  which 
measures  the  similarity  of  a  feature  matrix  at  a  pixel  of  in¬ 
terest  to  its  neighboring  feature  matrices. 

Bruce  and  Tsotsos  [14]  proposed  the  Attention  based  on 
Information  Maximization  (AIM)  model  by  employing  the 
first  principles  of  information  theory.  They  model  bottom- 
up  saliency  as  the  maximum  information  sampled  from  an 
image.  More  specifically,  saliency  is  computed  as  Shan¬ 
non’s  self-information  —log  p{f),  where  /  is  a  local  vi¬ 
sual  feature.  Hou  and  Zhang  [9]  introduced  the  Incremental 
Coding  Length  (ICL)  approach  to  measure  the  respective 
entropy  gain  of  each  feature.  The  goal  is  to  maximize  the 
entropy  of  the  sampled  visual  features. 

Some  models  measure  saliency  in  the  frequency  domain. 
Hou  and  Zhang  [8]  propose  a  method  based  on  relating  ex¬ 


tracted  spectral  residual  features  of  an  image  in  the  spectral 
domain  to  the  spatial  domain.  Guo  et  al.  [24]  show  that  in¬ 
corporating  the  Phase  spectrum  of  the  Quaternion  Fourier 
Transform  (PQFT)  instead  of  the  amplitude  transform  leads 
to  better  saliency  predictions  in  the  spatio-temporal  domain. 

Some  models  learn  saliency.  Kienzle  et  al.  [2]  utilize 
support  vector  machines  (SVM)  to  learn  saliency  of  each 
image  patch  directly  from  human  eye  tracking  data.  Sim¬ 
ilarly,  Judd  et  al.  [3]  train  a  linear  SVM  from  human  eye 
movement  data,  using  a  set  of  low,  mid,  and  high-level 
image  features  to  define  salient  locations.  Feature  vectors 
from  highly  fixated  locations  are  assigned  class  label  +1 
while  less  fixated  locations  are  assigned  label  —1.  Zhao 
and  Koch  [22]  used  least-squares  regression  to  learn  the 
weights  associated  with  a  set  of  feature  maps  from  subjects 
freely  fixating  natural  scenes  drawn  from  four  different  eye¬ 
tracking  data  sets.  They  find  that  the  weights  can  be  quite 
different  for  different  data  sets,  but  their  face-detection  and 
orientation  channels  are  usually  more  important  than  color 
and  intensity  channels.  Navalpakkam  and  Itti  [29]  define 
visual  saliency  in  terms  of  signal  to  noise  ratio  (SNR)  of 
a  target  object  versus  background  and  learn  parameters  of  a 
linear  combination  of  low-level  features  that  cause  the  high¬ 
est  expected  SNR  for  detecting  a  target  from  distractors. 

Contributions.  The  models  reviewed  above  fall  into 
two  general  categories:  1)  models  that  calculate  saliency  by 
implementing  local  center- surround  operations  (e.g.,  Itti  et 
al.  [4],  Surprise  [10],  Judd  et  al.  [3],  GBVS  [7],  and 
Rahtu  et  al.  [39]),  2)  models  that  find  salient  regions  glob¬ 
ally  by  calculating  rarity  of  features  over  the  entire  scene 
(e.g.,  AIM  [14],  SUN  [12],  Torralba  [26],  SRM  [8],  ICL  [9], 
and  Rarity  model  [25]).  Our  first  contribution  is  to  propose 
a  unified  model  that  benefits  from  the  advantages  of  both  ap¬ 
proaches,  which  thus  far  have  been  treated  independently. 
Note  that  the  ideas  of  local  and  global  context  have  been 
(separately)  considered  in  the  past  [44]  [17]  by  salient  ob¬ 
ject  detection/segmentation  approaches,  but  those  have  not 
yet  been  tested  with  human  fixation  prediction,  which  is  the 
goal  of  most  models  (including  ours). 

Almost  all  saliency  approaches  utilize  a  color  channel. 
Some  have  used  RGB  (e.g.,  [4][3][14][12])  while  others 
have  employed  Lab  (e.g.,  [42] [18] [39]),  inspired  by  the 
finding  that  it  better  approximates  human  color  perception. 
In  particular.  Lab  aspires  to  perceptual  uniformity,  and  its 
L  component  closely  matches  human  perception  of  light¬ 
ness,  while  the  a  and  b  channels  approximate  the  human 
chromatic  opponent  system.  RGB,  on  the  other  hand,  is  of¬ 
ten  the  default  choice  for  scene  representation  and  storage. 
We  argue  that  employing  just  one  color  system  does  not  al¬ 
ways  lead  to  successful  outlier  detection.  In  Fig.l,  we  show 
that  interesting  objects  in  some  images  are  more  salient  in 
Lab  color  space,  while,  for  some  others,  saliency  detec¬ 
tion  works  better  in  RGB.  Hence,  a  yet  unexplored  strat- 


Figure  2.  Diagram  of  our  proposed  model.  First,  the  input  image  is 
transformed  into  Lab  and  RGB  formats.  Then,  in  each  channel  of  a  color 
space,  a  global  saliency  map  based  on  rarity  of  an  image  patch  in  the  entire 
scene,  and  a  local  saliency  map,  the  dissimilarity  between  a  patch  and  its 
surrounding  window,  are  computed,  normalized,  and  combined.  Outputs 
of  color  channels  (i.e.,  L,  a,  or  b,  similarly  for  RGB)  are  normalized  and 
combined  once  more  to  form  the  output  of  a  color  system.  The  final  map 
is  the  summation  of  the  normalized  maps  in  two  color  spaces. 

egy,  which  is  our  second  contribution,  is  combining  saliency 
maps  from  both  color  spaces. 

We  compare  accuracy  of  our  model  and  its  subcompo¬ 
nents  with  the  mainstream  models  over  four  benchmark  eye 
tracking  datasets.  These  are  top-ranked  models  that  previ¬ 
ous  studies  have  shown  to  be  significantly  predictive  of  eye 
fixations  in  free  viewing  of  natural  scenes. 

2.  Proposed  Saliency  Model 

Our  proposed  framework  is  presented  in  Fig.  2.  An  in¬ 
put  image  in  two  formats  (Lab  and  RGB)  undergoes  the 
same  saliency  detection  and  the  resultant  maps  in  each  color 
system  are  normalized  and  summed.  In  each  color  format, 
two  local  and  global  saliency  operations  are  applied  to  each 
color  sub-channel  separately.  While  the  first  operation  de¬ 
tects  outliers  in  a  local  surrounding,  the  latter  calculates  the 
rarity  of  a  feature  or  a  region  over  the  entire  scene.  Then, 
local  and  global  rarities  are  combined  to  generate  the  output 
of  each  channel.  Channel  output  maps  are  then  normalized 
and  summed  once  more  to  generate  the  saliency  map.  The 
whole  process  can  be  performed  over  several  scales.  There 
is  no  need  to  directly  calculate  the  orientation  channel  in 
our  model  (since  some  patches  from  the  chosen  ensemble 
will  emulate  it;  see  below). 

There  is  a  large  body  of  behavioral  support  for  both  local 
and  global  operations  from  the  cognitive  science  literature. 
While  early  studies  favored  the  thesis  that  local  contrast  at¬ 
tracts  attention  [15]  [19]  [4],  recent  work  has  shifted  toward 
understanding  top-down  conceptual  factors  which  seem  to 
operate  at  the  object  level  (see  Figs.  1  and  7  for  some  ex¬ 
amples).  Such  factors  in  free- viewing  include  human  body, 
signs,  cars,  faces  and  text  [3]  [45].  Particularly,  Einhauser  et 
al.  [30],  showed  that  objects  predict  human  fixations  bet¬ 


Figure  3.  A  dictionary  of  200  basis  functions  learned  from  a  large  repos¬ 
itory  of  natural  images  for  the  L  channel  of  the  Lab  color  space.  Image  size 
and  patch  size  {vS)  were  512  x  512  and  8x8,  respectively. 

ter  than  low-level  saliency.  Also  it  has  been  shown  that  in¬ 
teresting  objects  are  more  salient  within  a  scene,  providing 
support  in  favor  of  object-based  attention  [43].  As  rare  fea¬ 
tures  are  more  likely  to  belong  to  a  single  object  and  since 
objects  are  rare  compared  to  background  in  natural  scenes, 
we  believe  that  global  saliency  can  help  detecting  top-down 
object-level  concepts.  Thus,  instead  of  leaning  on  only  one 
component  (local  or  global),  an  effective  strategy  is  inte¬ 
grating  both  of  these  complementary  processes. 

We  estimate  saliency  on  a  patch-by-patch  basis:  each  im¬ 
age  patch  is  projected  into  the  space  of  a  dictionary  of  image 
patches  (basis  functions)  learned  from  a  repository  of  natu¬ 
ral  scenes.  Each  patch  of  an  image  is  then  represented  by  a 
vector  of  basis  coefficients  that  can  linearly  reconstruct  it. 

2.1.  Image  representation 

It  is  well  known  that  natural  images  can  be  sparsely  rep¬ 
resented  by  a  set  of  localized  and  oriented  filters  [27]  [28]. 
Also,  recent  progress  in  computer  vision  has  demonstrated 
that  sparse  coding  is  an  effective  tool  for  image  repre¬ 
sentation  for  several  applications  such  as  image  classifica¬ 
tion  [31]  [32],  face  recognition  [33],  image  denoising  [34], 
as  well  as  saliency  detection  [14]  [12]  [9].  The  underlying 
idea  behind  sparse  coding  is  that  a  vision  system  should  be 
adapted  based  on  statistics  of  the  visual  environment  where 
it  is  supposed  to  operate.  As  a  supporting  evidence  for  this 
theory,  it  has  been  shown  that  receptive  fields  (RE)  of  some 
neurons  in  VI  cortex  resemble  those  REs  that  are  learned 
by  sparse  coding  algorithms  [27]. 

Mathematically,  given  a  set  of  n  m— dimensional  ba¬ 
sis  signals  (dictionary)  D  =  [di,d2, . . .  d^]  G  the 

sparse  coding  of  an  input  signal  x  G  can  be  found  by 
solving  an  ‘7i-norm  minimization  problem”: 

a*(x,D)  =  argmin  -||x  —  Da||2  +  Ai||a||i  (1) 

where  ||.||i  denotes  the  /i-norm  and  Ai  is  a  regularization 
parameter.  Thus,  x  ^  x  =  Da*  where  x  is  the  estimation 
of  X.  To  learn  the  dictionary  D,  considering  a  training  set 
of  q  data  samples  Y  =  [yi,  y2:  •  •  •  y^]  in  an  empiri¬ 
cal  cost  function  gq(D)  =  -  i^  minimized. 


where  lu{y,  D)  is: 

^«(yi,D)  =  TO*naeRni||y.  -  Da||2  +  Ai||a||i  (2) 

We  represent  an  image  patch  by  a  linear  combination  of 
some  basis  functions  which  correspond  or  act  as  feature  de¬ 
tectors  in  early  visual  areas  of  the  brain  (neuron  receptive 
fields  or  transfer  functions).  Given  an  input  image,  it  is  first 
resized  to  2^  x  2^  pixels  where  patch  size  w  is  selected  in 
a  way  that  2^  is  divisible  to  w.  Let  P  =  {Pi,  P2,  •  •  •  Pnl 
represent  the  set  of  linearized  image  patches  from  top-left 
to  bottom-right  with  no  overlap.  Then  using  Eq.  1,  coef¬ 
ficients  that  reconstruct  each  patch  are  calculated  and  are 
used  to  represent  that  patch.  By  reshaping  reconstructed 
patches  and  aligning  them,  the  original  image  can  be  repro¬ 
duced. 

To  learn  a  dictionary  of  patch  bases  (i.e.,  minimizing 
gq(D)),  we  extracted  500, 000  8x8  image  patches  (for  each 
sub  channel  of  RGB  or  Lab)  from  1500  randomly  selected 
color  images  from  natural  scenes.  Each  basis  function  in 
the  dictionary  is  a  8  x  8  =  64D  vector.  A  sample  learned 
dictionary  of  size  200  is  shown  in  Eig.  3.  We  experimented 
with  different  dictionary  sizes  (10,  50,  100,  200,  400,  and 
1000)  and  realized  that  fixation  prediction  results  did  not 
change  much.  The  sparse  codes  cxi  are  computed  with  the 
above  basis  using  the  LARS  algorithm  [5]  implemented  in 
the  SPAMS  toolbox^ . 

2.2.  Measuring  visual  saliency 

Our  model  is  based  on  two  saliency  operations.  The 
first  one,  local  contrast,  considers  the  rarity  of  image  re¬ 
gions  with  respect  to  (small)  local  neighborhoods  (guided 
by  the  well-established  computational  architecture  of  Koch 
and  Ullman  [19]  and  Itti  et  al  [4]).  The  second  operation, 
global  contrast,  evaluates  saliency  of  an  image  patch  using 
its  contrast  with  respect  to  the  patch  statistics  over  the  entire 
image.  Einally,  local  and  global  contrast  maps  are  consoli¬ 
dated.  We  repeat  the  process  for  each  channel  of  both  RGB 
and  Lab  color  systems  and  fuse  saliency  maps  of  each  sub 
channel  of  a  color  space  to  generate  a  saliency  map  for  each 
color  system.  At  each  stage,  maps  are  normalized  before 
integration  (See  Eig.  2). 

Local  saliency.  Local  saliency  {Si)  in  our  model  is 
the  average  weighted  dissimilarity  between  a  center  patch 
i  (blue  rectangle  in  Eig.  4)  and  its  L  patches  in  a  rectangular 
neighborhood  (red  rectangle  in  Eig.  4): 

i=i 

where  Wij  is  the  Euclidean  distance  between  the  center 
patch  i  and  the  surround  patch  j.  Thus,  those  patches  fur¬ 
ther  away  from  the  center  patch  will  have  less  infiuence  on 

^http : //www . di . ens . f r /will ow/SP AMS/ index . html 


Figure  4.  Illustration  of  global  and  local  saliency  for  an  image  patch. 

Global  saliency  measures  the  rarity  of  patch  in  the  entire  scene  while  local 
rarity  measures  the  difference  between  a  patch  and  its  surrounding  context. 


the  saliency  of  the  center  patch.  Dij  denotes  the  Euclidean 
distance  between  patch  i  and  patch  j  in  the  feature  space  be¬ 
tween  OLi  and  OLj,  vectors  of  coefficients  for  patches  i  and  j, 
respectively  derived  from  sparse  coding  (Sec.  2.1).  While 
here  we  use  the  Euclidean  distance  {I2  distance),  the  KL 
distance  [23] [38],  h  distance  [17],  or  correlation  coefficient 
have  also  been  used  in  the  past  to  calculate  patch  similarity. 
Superscript  c  denotes  color  sub  channels  (L,  a,  or  6  in  Lab 
or  R,  G,  or  B  in  RGB). 

Global  saliency.  It  often  happens  that  a  local  patch  is 
similar  to  its  neighbors  but  the  whole  region  (i.e.,  local 
+  surrounding)  is  still  in  global  rarity  in  the  entire  scene. 
Using  only  the  local  saliency  may  suppress  areas  within  a 
homogeneous  region  resulting  in  blank  holes,  which  some¬ 
times  impedes  object-based  attention  (e.g.,  a  uniformly  tex¬ 
tured  object  would  only  be  salient  at  its  borders).  To  rem¬ 
edy  such  shortcoming,  we  build  our  global  saliency  oper¬ 
ator  guided  by  the  information-theoretic  saliency  measure 
of  Bruce  and  Tsotsos  [14].  Instead  of  each  pixel,  here  we 
calculate  the  probability  of  each  patch  P(p  J  over  the  entire 
scene  and  use  its  inverse  as  the  global  saliency: 

n 

logiSliVi))  =  -logiPiVi))  = -Y^logiPiaij)) 

pt  (4) 

n 

i=i 

To  calculate  P(pJ,  we  assume  that  coefficients  a  are  con¬ 
ditionally  independent  from  each  other.  This  is  to  some  ex¬ 
tent  guaranteed  by  the  sparse  coding  algorithm  [5].  Eor  each 
coefficient  of  the  patch  representation  vector  (i.e.,  aij),  first 
a  binned  histogram  (100  bins  here)  is  calculated  from  all 
of  the  patches  in  the  scene  and  is  then  converted  to  a  pdf 
{P{aij))  by  dividing  to  its  sum.  If  a  patch  is  rare  in  one  of 
the  features,  the  above  product  will  get  a  small  value  lead¬ 
ing  to  high  global  saliency  for  that  patch  overall.  Eig.  4 
illustrates  the  process  of  calculating  global  saliency. 

Combined  saliency.  Local  and  global  saliency  maps  are 


then  normalized  and  combined: 

5f,(p,)=AA(5f(pJ)  oAA(5,^(p,))  (5) 

where  o  is  an  integration  scheme  (i.e.,  {+,  *,  max,  or 
min}).  Through  the  experiments,  we  found  that  max  in  this 
stage  leads  to  slightly  higher  accuracy  than  others.  Then, 
saliency  values  of  a  patch  in  all  channels  are  normalized 
and  summed  again  to  generate  the  saliency  of  a  patch  in 
each  color  system.  For  Lab  color  system,  we  have: 

E  (6) 

cEL,a,b 

The  same  operation  applies  to  the  RGB  color  space.  Fi¬ 
nal  saliency  for  a  patch  is  then  summation  of  normalized 
saliency  maps  in  both  color  systems: 

+  V(5««^(p,))  (7) 

Normalization  (A/").  Similar  to  [4]  first,  the  average 
of  all  local  maxima  (defined  as  greater  than  4  neighboring 
points)  with  intensity  above  a  threshold  is  calculated  (Mi). 
Then  a  map  is  multiplied  by  p  =  (Mg  —  Mi)‘^  where  Mg  is 
the  global  maximum  in  the  map  (known  as  maxnorm). 

Extension  to  the  scale  space.  Since  objects  appear 
at  different  sizes  and  depths,  it  is  necessary  to  perform 
saliency  detection  at  several  spatial  scales.  To  make  our  ap¬ 
proach  multi-scale,  we  calculate  the  saliency  of  downsam¬ 
pled  images  (divisions  by  2)  from  the  original  image  and 
then  take  the  average  after  normalization: 

1  ^ 

where  M  is  the  number  of  scales  and  (x)  is  the  saliency 
of  pixel  X  derived  from  the  saliency  map  created  by  Eq.  7. 
Finally,  we  smooth  the  resultant  map  by  convolving  it  with 
a  small  Gaussian  kernel  for  better  visualization. 

Handling  center-bias.  A  tedious  and  challenging  factor 
in  saliency  modeling  is  handling  center-bias  in  eye  track¬ 
ing  data,  which  is  the  tendency  of  human  subjects  to  pref¬ 
erentially  look  near  the  image  center  [13].  This  generates 
a  high  central  peak  in  the  overall  2D  histogram  of  fixa¬ 
tions,  resulting  in  high  scores  for  a  trivial  saliency  model 
whose  map  is  just  a  Gaussian  blob  at  the  image  center. 
To  account  for  center-bias,  some  models  intrinsically  (e.g., 
GBVS  [7],  E-Saliency  [37])  or  extrinsically  (e.g.,  Judd  et 
al.  [3]  and  Yang  et  al.  [16])  add  a  center  prior  to  their  algo¬ 
rithms^.  Here,  instead  of  adding  center  bias  to  our  model, 
we  use  a  scoring  metric  that  discounts  center-bias  in  a  non- 
parametric  manner  (See  next  section)  when  evaluating  our 
and  other  saliency  models  against  eye-tracking  data. 

^Some  models  add  center-bias  by  either  fitting  a  2D  Gaussian  to  fixation  data  or 
simply  by  just  using  the  average  fixation  map  (i.e.,  2D  histogram). 


3.  Experimental  Setup 

To  validate  our  proposed  method,  we  carried  out  several 
experiments  on  four  benchmark  datasets  using  the  “shuf¬ 
fled  AUC”  score  described  below^.  The  main  reason  behind 
employing  several  datasets  is  that  current  datasets  have  dif¬ 
ferent  image  and  feature  statistics,  stimulus  variety,  biases 
(e.g.,  center-bias),  and  eye  tracking  parameters.  Hence,  it  is 
necessary  to  employ  several  datasets  as  models  leverage  dif¬ 
ferent  features  that  their  distribution  varies  across  datasets. 

Evaluation  metric.  The  most  widely  used  score  for 
saliency  model  evaluation  is  the  AUC  [14].  In  AUC,  hu¬ 
man  fixations  for  an  image  are  considered  as  the  positive 
set  and  some  points  from  the  image  are  randomly  chosen 
(uniformly)  as  the  negative  set.  The  saliency  map  is  then 
treated  as  a  binary  classifier  to  separate  the  positive  sam¬ 
ples  from  the  negatives.  By  thresholding  over  the  saliency 
map  and  plotting  true  positive  rate  vs.  false  positive  rate, 
an  ROC  curve  is  achieved  and  its  underneath  area  is  calcu¬ 
lated.  A  problem  with  AUC  is  that  it  generates  a  large  value 
for  a  central  Gaussian  model  and  is  thus  affected  by  center- 
bias  [13].  To  tackle  center  bias,  Zhang  etal.  [12]  introduced 
shuffled  AUC  score,  with  the  only  difference  that  instead  of 
selecting  negative  points  randomly  from  a  uniform  distribu¬ 
tion,  all  human  fixations  (except  the  positive  set)  are  used 
as  the  negative  set.  Shuffled  AUC  score  generates  a  value 
of  0.5  for  both  a  central  Gaussian  and  a  completely  uniform 
map.  Please  note  that  in  addition  to  shuffled  AUC,  there  are 
also  some  other  scores  that  have  been  often  used  in  the  past, 
for  example  Normalized  Scanpath  Saliency  (NSS)  [35],  KL 
distance  [10],  and  Correlation  Coefficient  [46].  But  here  we 
avoid  using  them  as  they  are  all  affected  by  center-bias.  In¬ 
stead,  we  adopt  the  shuffled  AUC  score  which  is  becoming 
a  standard  for  saliency  model  evaluation  [40]  [12]. 

Utilized  fixation  datasets  are  briefly  described  below. 

TORONTO^  [14].  This  is  the  most  widely  used  dataset 
for  model  comparison.  It  contains  120  color  images  with 
resolution  of  511  x  681  pixels  from  indoor  and  outdoor  en¬ 
vironments.  Images  are  presented  at  random  to  20  subjects 
for  3  seconds  with  2  seconds  of  gray  mask  in  between. 

MIT^  [3].  This  is  the  largest  dataset  containing  1003 
images  (resolution  from  405  x  1024  to  1024  x  1024  pixels) 
collected  from  Elicker  and  LabelMe  datasets.  There  are  779 
landscape  and  228  portrait  images.  Eifteen  subjects  freely 
viewed  images  for  3  sec.  with  1  sec.  delay  in  between. 

KOOTSTRA^  [20].  This  dataset  contains  101  images 
from  5  different  categories:  12  animals,  12  automan,  16 
buildings,  20  fiowers,  and  41  natural  scenes.  Images  are  ob- 

^Our  software  for  score  calculation  and  saliency  maps  over  4 
datasets  are  available  at:  https://sites.google.com/site/ 
saliencyevaluation/ . 

^Available  at:  http  :  /  / www-  sop  .  inria  .  fr /members  /Neil .  Bruce 

^This  dataset  is  available  at:  http :  /  /people  .  csail  .mit .  edu/t  judd/ 

^This  dataset  is  available  at:  http :  /  /www .  esc  .  kth .  se/  ~koot  stra/ 


Figure  5.  Model  comparison.  Fixation  prediction  accuracy  of  our  saliency  operations  (Local,  Global,  LG  (Local  +  Global))  along  with  10  state-of-the-art 
models  over  4  benchmark  datasets.  X-axis  indicates  the  a  of  the  Gaussian  kernel  (in  image  width)  by  which  maps  are  smoothed.  NUSEF  dataset  contains 
some  images  with  copyright  which  are  not  easily  accessible  and  we  don’t  use.  Only  412  images  are  used  here. 


Dataset 

AIM 

[14] 

GBVS 

[  ] 

SRM 

[8] 

ICL 

[9] 

Itti 

[4] 

Judd 

[3] 

PQFT 

[24] 

SDSR 

[11] 

SUN 

[12] 

Surprise 

[10] 

Local 

Si 

Global 

Sg 

LG 

Slg 

Gauss 

lO 

TORONTO  [14] 

0.67 

0.647 

0.685 

0.691 

0.61 

0.68 

0.657 

0.687 

0.66 

0.605 

0.691 

0.69 

0.696 

0.50 

0.73 

Optimal  (7 

0.01 

0.02 

0.05 

0.01 

0.07 

0.03 

0.04 

0.05 

0.03 

0.06 

0.04 

0.03 

0.03 

- 

- 

MIT  [3] 

0.664 

0.637 

0.65 

0.666 

0.61 

0.658 

0.65 

0.646 

0.649 

0.62 

0.653 

0.676 

0.678 

0.50 

0.75 

Optimal  (7 

0.02 

0.02 

0.05 

0.03 

0.06 

0.02 

0.04 

0.05 

0.04 

0.05 

0.04 

0.04 

0.03 

- 

- 

KOOTSTRA  [20] 

0.575 

0.563 

0.576 

0.589 

0.57 

0.587 

0.57 

0.59 

0.55 

0.566 

0.591 

0.578 

0.593 

0.50 

0.62 

Optimal  (7 

0.01 

0.01 

0.04 

0.01 

0.07 

0.02 

0.03 

0.03 

0.02 

0.07 

0.03 

0.02 

0.03 

- 

- 

NUSEF  [6] 

0.623 

0.595 

0.62 

0.614 

0.56 

0.61 

0.60 

0.60 

0.60 

0.58 

0.583 

0.627 

0.632 

0.49 

0.66 

Optimal  (7 

0.04 

0.01 

0.06 

0.03 

0.09 

0.03 

0.05 

0.04 

0.04 

0.06 

0.05 

0.04 

0.05 

- 

- 

Table  1.  Maximum  performance  of  models  shown  in  Fig.  5.  Numbers  in  second  rows  are  the  sigma  values  where  models  take  their  maximum  perfor¬ 
mance.  Parameter  settings:  Surround  window  size  =  1;  number  of  scales  =  1  (256  x  256).  Accuracies  of  three  best  models  over  each  dataset  are  shown  in 
bold  face  font.  LG  is  the  Local+Global  model  and  lO  stands  for  the  human  inter-observer  model. 


served  by  31  subjects  in  the  age  range  of  17  to  32  for  5  sec¬ 
onds.  Image  resolution  is  768  x  1024  pixels.  This  dataset  is 
specially  challenging  because  there  are  not  explicit  objects 
or  salient  regions  within  many  of  the  images. 

NUSEF’  [6].  This  dataset  includes  758  images  contain¬ 
ing  emotionally  affective  scenes/objects  such  as  expressive 
faces,  nudes,  unpleasant  concepts,  and  interactive  actions. 
In  total,  75  subjects  free-viewed  part  of  the  image  set  for  5 
seconds  each  (on  average  25  subjects  per  image). 

4.  Performance  Evaluation 

Here,  along  with  the  evaluation  of  our  model,  we  also 
compare  10  state-of-the-art  bottom-up  saliency  models. 
Softwares  for  these  models  are  publicly  available^.  Addi¬ 
tionally,  we  implemented  two  simple  models,  to  serve  as 
baseline:  Gaussian  Blob  (Gauss)  and  Human  inter-observer 
(10).  Gaussian  blob  is  simply  a  2D  Gaussian  shape  drawn 
at  the  center  of  the  image;  it  is  expected  to  predict  human 
gaze  well  if  such  gaze  is  strongly  clustered  around  the  cen¬ 
ter  [13].  The  human  model  outputs,  for  a  given  stimulus,  a 
map  built  by  integrating  fixations  from  other  subjects  than 
the  one  under  test  while  they  watched  that  stimulus.  The 
human  map  is  usually  smoothed  by  convolving  with  a  small 

^Available  at:  http  :  /  /mmas  .  comp  .nus.edu.  sg/NUSEF  .  html 

^  AIM:  http : / / www- sop . inria . fr /members /Neil . Bruce/ 

GBVS:  http : / / www . klab . caltech . edu/ ~harel/ 

SRM  & ICL:  http : / / www . klab . caltech . edu/ ~xhou/ 

Itti  &  Surprise:  http  ://ilab.usc.  edu/ toolkit/ 

Judd:  http: / /people . csail .mit .edu/t judd/ 

PQFT: http : //visual- at tent ion- processing . google code . com/ 
SDSR:  http : / /alumni . soe . ucsc . edu/ ~ rokaf / 

SUN:  http: // cseweb.ucsd.edu/~16zhang/ 


Gaussian  kernel.  This  model  provides  an  upper-bound  on 
prediction  accuracy  of  saliency  models  to  the  degree  that, 
different  humans  may  be  the  best  predictors  of  each  other. 
Model  maps  were  resized  to  the  size  of  the  original  image, 
onto  which  eye  data  have  been  recorded. 

An  important  parameter  in  model  comparison  is  smooth¬ 
ness  (blurring)  of  saliency  maps  [40].  Here,  we  smoothed 
the  saliency  map  of  each  model  by  convolving  it  with  a  vari¬ 
able  size  Gaussian  kernel.  Fig.  5  presents  the  shuffled  AUC 
score  of  models  over  the  range  of  standard  deviations  a  of 
the  Gaussian  kernel  in  image  width  (from  0.01  to  0.13  in 
steps  of  0.01).  The  maximum  score  value  over  this  range 
for  each  model  is  shown  in  Table  1 .  Smoothing  the  saliency 
map  dramatically  affects  the  accuracy  of  some  models  (e.g., 
Itti,  Surprise,  PQFT,  and  SUN).  Our  combined  saliency 
model  (local  +  global  and  RGB  -b  Lab)  outperforms  other 
models  over  4  datasets  with  a  larger  margin  over  the  MIT 
and  NUSEF  datasets.  Our  local  and  global  saliency  oper¬ 
ators  have  less  accuracy  than  the  combined  model  but  are 
still  above  several  models.  Results  show  that  global  saliency 
works  better  than  local  saliency  operator  over  large  datasets 
(MIT  and  NUSEF)  while  they  are  close  to  each  other  over 
TORONTO  dataset.  Models  were  more  successful  over  the 
TORONTO  and  MIT  datasets  and  less  over  KOOTSTRA 
and  NUSEF,  possibly  because  of  the  higher  complexity  of 
stimuli  in  these  datasets.  The  NUSEF  dataset  contains  many 
affective  and  emotional  stimuli  while  KOOTSTRA  dataset 
contains  images  without  well-defined  interesting  and  salient 
objects  (e.g.,  nature  scenes,  trees,  and  fiowers). 


TORONTO 


KOOTSTRA 


Dataset 

RGB 

Lab 

RGB  +  Lab 

Si 

Slg 

Si 

Slg 

Si 

Slg 

TORONTO 

0.646 

0.647 

0.653 

0.670 

0.660 

0.660 

0.678 

0.668 

0.683 

MIT 

0.627 

0.639 

0.640 

0.646 

0.644 

0.651 

0.658 

0.663 

0.667 

KOOTSTRA 

0.574 

0.572 

0.578 

0.572 

0.555 

0.570 

0.589 

0.573 

0.591 

NUSEF 

0.599 

0.610 

0.610 

0.556 

0.596 

0.592 

0.569 

0.614 

0.616 

Table  2.  RGB  vs.  Lab  for  saliency  detection.  Sf.  Local;  Sg:  Global; 
Sig-.  Local  +  Global.  Parameter  settings:  scales  (M)  =1  (256  x  256);  Win¬ 
dow  size  =  1.  Results  are  over  original  saliency  maps  without  smoothing. 


Among  compared  models,  ICL  [12],  AIM  [14], 
SDSR  [11],  and  Judd  et  al.  [3]  performed  higher  than  the 
rest.  Itti  et  al.  [4]  and  Surprise  [10]  models  are  ranked  at 
the  bottom.  As  we  expected,  a  trivial  Gaussian  blob  located 
at  the  image  center  scores  around  0.5  over  all  datasets  and 
human  model  scores  the  best  providing  a  gold  standard  for 
visual  saliency  models.  Humans  are  less  correlated  over 
KOOTSTRA  and  NUSEF  datasets. 

Lab  vs.  RGB  for  saliency  detection.  To  assess  the 
power  of  Lab  and  RGB  color  spaces  for  saliency  detection, 
we  performed  an  experiment  using  each  of  the  two  color 
systems.  Results  over  all  four  datasets  are  shown  in  Table  2. 
According  to  our  results,  it  is  not  possible  to  tell  which  color 
system  is  the  best.  The  Lab  color  space  leads  to  higher  accu¬ 
racies  over  TORONTO  and  MIT  datasets  while  RGB  works 
better  over  KOOTSTRA  and  NUSEL  datasets.  Integrating 
both  color  systems  leads  to  higher  accuracy  than  each  one 
taken  separately,  consistently  over  all  four  datasets.  This 
indicated  the  importance  of  saliency  integration  over  both 
color  systems.  Also  note  that  over  each  component  (local 
or  global),  combination  of  RGB  and  Lab  leads  to  higher 
performance  than  each  of  the  color  systems. 

Influence  of  surrounding  window  size  and  number  of 
scales.  Here  we  analyze  how  the  size  of  the  surrounding 
window  (and  hence  number  of  neighbors)  and  number  of 
spatial  scales  affect  performance  of  our  models.  As  the  left 
diagram  of  Fig.  6  shows,  increasing  the  number  of  neigh¬ 
bors  reduces  the  accuracy  of  the  local  saliency  operator. 
Correspondingly,  this  reduces  the  accuracy  of  the  combined 
model.  Note  that  the  global  operator  is  not  affected  by  this 
parameter.  Shown  in  the  right  panel  of  Fig.  6,  increasing 
the  number  of  scales  enhances  the  results  to  a  certain  point 
(here  using  3  scales  [256,  128,  and  64])  and  then  drops 
(when  using  4  scales  [512  256  128  64]). 

Runtime  aspects.  It  takes  approximately  5  seconds  for 
our  model  to  process  a  256  x  256  image  in  both  RGB  and 
Lab  color  spaces  using  a  personal  computer  running  Linux 
Ubuntu  with  5.8  GB  RAM  and  12  Core  Intel  i7  3.2  GHz 
CPU.  Our  model  is  faster  than  AIM  (16  sec),  Judd  (without 
object  detectors)(4.7  sec),  close  to  SDSR  (2.4)  and  GBVS 
(2),  and  slower  than  PQFT  (1  sec),  SUN  (1),  Itti  (0.28)  (us¬ 
ing  [52]),  and  ICL  (0.1)  models.  Our  global  saliency  opera¬ 
tor  is  appx.  3  times  faster  than  our  local  saliency  operator. 

For  qualitative  assessment,  we  show  in  Fig.  7  saliency 


Figure  6.  Parameter  analysis.  Left:  Effect  of  the  surround  window  size  on  accu¬ 
racy  over  TORONTO  dataset  using  256  x  256  images  (M  =  1).  Right:  Influence  of 
scale  on  results  over  TORONTO  and  KOOTSTRA  datasets  (window  size  =1).  First 
three  bars  are  256,128,64  and  fourth  one  represents  four  scales  512,256,128,64. 


maps  of  our  combined  saliency  model  and  compared  mod¬ 
els  for  sample  images  from  TORONTO  and  MIT  datasets. 

5.  Conclusions  and  Future  Works 

We  enhance  the  state-of-the-art  in  saliency  modeling  by 
proposing  an  accurate  and  easy-to-implement  model  that 
utilizes  image  representations  in  both  RGB  and  Lab  color 
spaces.  Furthermore,  we  introduce  one  local  and  one  global 
saliency  operator  each  representing  a  class  of  previous  mod¬ 
els  to  some  extent.  We  conclude  that  integration  of  local  and 
global  saliency  operators  works  better  than  just  using  ei¬ 
ther  one,  which  encourages  more  research  in  this  direction. 
Similarly,  combining  both  color  systems  strongly  benefits 
saliency  detection  and  eye  fixation  prediction. 

There  are  two  areas  that  we  would  like  to  improve  upon. 
The  first  one  is  incorporating  top-down  factors  for  fixation 
prediction.  The  large  gap  between  models  and  the  human 
inter-observer  model  (see  Table  1)  is  mainly  due  to  role 
of  top-down  concepts  (e.g.,  faces,  text  [45],  people,  and 
cars  [3],  affective  and  emotional  stimuli  or  actions  within 
scenes  [6])  when  freely  viewing  scenes.  While  some  of 
these  factors  have  been  utilized  for  saliency  detection  in  the 
past  [3],  adding  more  top-down  features  (e.g.,  by  reliable 
detection  of  text  on  natural  scenes)  can  scale  up  accuracy  of 
current  models.  The  second  area  is  extending  our  model  for 
saliency  detection  in  spatio-temporal  domain  (videos). 

Supported  by  the  National  Science  Foundation  (grant  number  BCS- 
0827764),  and  the  Army  Research  Office  (W911NF-08-1-0360  and 
W911NF-1 1-1-0046),  and  U.S.  Army  (W81XWH- 10-2-0076).  The  authors 
affirm  that  the  views  expressed  herein  are  solely  their  own,  and  do  not 
represent  the  views  of  the  United  States  government  or  any  agency  thereof 

References 

[1]  J.  M.  Wolfe  and  T.  S.  Horowitz.  What  attributes  guide  the  deployment  of  visual 
attention  and  how  do  they  do  it?  Nat.  Rev.  Neurosci.,  5:1-7,  2004.  1 

[2]  W.,  Kienzle,  A.  R,  Wichmann,  B.,  Scholkopf,  and  M.  O.  Franz.  A  nonparamet- 
ric  approach  to  bottom-up  visual  saliency.  NIPS,  2007.  2 

[3]  T.  Judd,  K.  Ehinger,  F.  Durand  and,  A.  Torralba.  Learning  to  predict  where 
humans  look,  ICCV,  2009.  2,  3,  5,  6,  7 

[4]  L.  Itti,  C.  Koch,  and  E.  Niebur.  A  model  of  saliency-based  visual  attention  for 
rapid  scene  analysis.  IEEE  PAMI,  1998.  1,  2,  3,  4,  5,  6,  7 

[5]  J.  Mairal,  F.  Bach,  J.  Ponce,  and  G.  Sapiro.  Online  learning  for  matrix  factor¬ 
ization  and  sparse  coding.  /.  of  Machine  Learning,  2010.  4 


Image  Human  Ours  AIM  GBVS  SER  ICL  Itti  Judd  PQFT  SDSR  SUN  Surprise 


Figure  7.  Visual  comparison  of  our  combined  saliency  model  and  10  state-of-the-art  models  over  samples  from  TORONTO  (top)  and  MIT  datasets. 


[6]  R.  Subramanian,  H.  Katti,  N.  Sebe,  M.  Kankanhalli,  and  T.S.  Chua.  An  eye 
fixation  database  for  saliency  detection  in  images.  ECCV,  2010.  6,  7 

[7]  J.  Harel,  C.  Koch,  R  Perona.  Graph-based  visual  saliency.  NIPS,  2006.  2,  5,  6 

[8]  X.  Hou  and  L.  Zhang.  Saliency  detection:  A  spectral  residual  approach.  CVPR, 
2007.  2,  6 

[9]  X.  Hou  and  L.  Zhang.  Dynamic  visual  attention:  Searching  for  coding  length 
increments.  NIPS,  2008.  2,  3,  6 

[10]  L.  Itti  and  P.  Baldi.  Bayesian  surprise  attracts  human  visual  attention.  NIPS, 
2005.  2,  5,  6,  7 

[11]  H.J.  Seo  and  P.  Milanfar.  Static  and  space-time  visual  saliency  detection  by 
self-resemblance.  Journal  of  Vision,  9,  2009.  2,  6,  7 

[12]  L.  Zhang,  M.  H.  Tong,  T.  K.  Marks,  H.  Shan,  and  G.  W.  Cottrell,  SUN:  A 
Bayesian  framework  for  saliency  using  natural  statistics.  /.  of  Vision,  8(32):  1- 
20,  2008.  2,  3,  5,  6,  7 

[13]  B.W.  Tatler.  /.  Vision,  14(7):1-17,  2007.  5,  6 

[14]  N.D.B.  Bruce  and  J.K.  Tsotsos.  Saliency  based  on  information  maximization. 
NIPS,  2005.  1,  2,  3,  4,  5,  6,  7 

[15]  A.M.  Treisman  and  G.  Gelade.  A  feature  integration  theory  of  attention.  Cog¬ 
nitive  Psych.,  12:97-136,  1980.  1,3 

[16]  Y.  Yang,  M.  Song,  N.  Li,  J.  Bu,  and  C.  Chen.  What  is  the  chance  of  happening:A 
new  way  to  predict  where  people  look.  ECCV,2010.  5 

[17]  L.  Duan,  C.  Wu,  J.  Miao,  L.  Qing,  and  Y.  Fu.  Visual  saliency  detection  by 
spatially  weighted  dissimilarity.  CVPR  2011.  2,  4 

[18]  O.  Le  Meur,  P  Le  Callet,  D.  Barba,  and  D.  Thoreau.  A  coherent  computational 
approach  to  model  bottom-up  visual  attention.  PAMI,  2006.  2 

[19]  C.  Koch  and  S.  Ullman.  Shifts  in  selective  visual  attention:  Towards  the  under¬ 
lying  neural  circuitry.  Human  Neurobiology,  1985.  1,  3,  4 

[20]  G.  Kootstra,  A.  Nederveen,  and  B.  de  Boer.  Paying  attention  to  symmetry. 
BMVC,  2008.  2,  5,  6 

[21]  R.  Valenti,  N.  Sebe,  and  T.  Gevers.  Image  saliency  by  isocentric  curvedness  and 
color.  .  ICCV,  2009.  2 

[22]  Q.  Zhao  and  C.  Koch.  Learning  a  saliency  map  using  fixated  locations  in  natural 
scenes.  Journal  of  Vision,  11(3),  2011.  2 

[23]  D.  Gao,  V.  Mahadevan,  and  N.  Vasconcelos.  The  discriminant  center-surround 
hypothesis  for  bottom-up  saliency.  NIPS,  2007.  2,  4 

[24]  C.  Guo  and  L.  Zhang.  A  novel  multiresolution  spatiotemporal  saliency  detec¬ 
tion  model  and  Its  applications  in  image  and  video  compression.  IEEE  Trans, 
on  Image  Processing,  2010.  2,  6 

[25]  M.  Mancas.  Computational  attention:  Modelisation  and  application  to  audio 
and  image  processing.  PhD.  thesis,  2007.  2 

[26]  A.  Torralba,  A.  Oliva,  M.  Castelhano  and  J.M.  Henderson.  Contextual  guidance 
of  attention  in  natural  scenes:  The  role  of  Global  features  on  object  search. 
Psychological  Review,  2006.  2 

[27]  B.  Olshausen  and  D.  Field.  Fmergence  of  simple-cell  receptive  field  properties 
by  learning  a  sparse  code  for  natural  images.  Nature,  1996.  3 

[28]  F.  Simoncelli  and  B.  Olshausen.  Natural  image  statistics  and  neural  represen¬ 
tation.  Annual  review  of  neuroscience,  24,  2001.  3 


[29]  V.  Navalpakkam  and  L.  Itti.  An  integrated  model  of  top-down  and  bottom-up 
attention  for  optimizing  detection  speed.  CVPR,  2006.  2 

[30]  W.  Einhauser,  M.  Spain,  and  P.  Perona.  Objects  predict  xations  better  than  early 
saliency.  Journal  of  Vision,  2008.  3 

[31]  C.  Kanan  and  G.  Cottrell.  Robust  classification  of  objects,  faces,  and  flowers 
using  national  image.  CVPR,  2010.  3 

[32]  F.  Bach,  J.  Mairal,  J.  Ponce,  and  G.  Spario.  Sparse  coding  and  dictionary  learn¬ 
ing  for  image  analysis.  CVPR,  2010.  3 

[33]  A.  Yang,  A.  Ganesh,  Z.  Zhou,  S.  Sastry,  and  Y.  Ma.  A  review  of  fast  11- 
minimization  algorithms  for  robust  face  recognition,  http://arxiv.org,  2010.  3 

[34]  M.  Elad  and  M.  Aharon.  Image  denoising  via  sparse  and  redundant  repre¬ 
sentations  over  learned  dictionaries.  IEEE  Transactions  on  Image  Processing, 
15(12):3336-3745,  2006.  3 

[35]  R.  Peters,  A.  Iyer,  L.  Itti,  and  C.  Koch.  Components  of  bottom-up  gaze  alloca¬ 
tion  in  natural  images.  Vision  Res.,  45,  2005.  5 

[36]  D.  Parkhurst,  K.  Law,  and  E.  Niebur.  Modeling  the  role  of  salience  in  the  allo¬ 
cation  of  overt  visual  attention.  Vision  Res.,  2002.  2 

[37]  T.  Avraham,  M.  Lindenbaum.  Esaliency  (Extended  Saliency):  Meaningful  at¬ 
tention  using  stochastic  image  modeling.  PAMI,  2010.  5 

[38]  D.A.  Klein  and  S.  Frintrop.  Center-surround  divergence  of  feature  statistics  for 
salient  object  detection.  ICCV,  2011.  4 

[39]  E.  Rahtu,  J.  Kannala,  M.  Salo,  and  J.  Heikkila.  Segmenting  salient  object  from 
images  and  videos.  ECCV,  2010.  2 

[40]  X.  Hou,  J.  Harel,  and  Christof  Koch.  Image  Signature:  Highlighting  sparse 
salient  regions.  IEEE  PAMI,  In  press.  5,  6 

[41]  L.  Itti,  N.  Dhavale,  and  F.  Pighin.  Realistic  avatar  eye  and  head  animation  using 
a  neurobiological  model  of  visual  attention.  SPIE,  2003.  2 

[42]  A.  Garcia-Diaz,  X.  R.  Fdez-Vidal,  X.  M.  Pardo,  and  R.  Dosil.  Decorrelation 
and  distinctiveness  provide  with  human-like  saliency.  ACTVS,  5807,  2009.  2 

[43]  L.  Elazary  and  L.  Itti.  Interesting  objects  are  visually  salient.  /.  Vision,  2008.  3 

[44]  M.M  Cheng,  G.X  Zhang,  N.J.  Mitra,  and  X.  Huang,  and  S.M.  Hu.  Global  Con¬ 
trast  based  Salient  Region  Detection.  CVPR,  2011.  2 

[45]  M.  Cerf,  J.  Harel,  W.  Einhauser,  and  C.  Koch.  Predicting  gaze  using  low-level 
saliency  combined  with  face  detection.  NIPS,  2007.  3,  7 

[46]  N.  Ouerhani,  R.  von  Wartburg,  H.  Hugh,  and  R.M.  Muri.  Empirical  validation 
of  saliency-based  model  of  visual  attention.  Electronic  Letters  on  Computer 
Vision  and  Image  Analysis,  2003.  5 

[47]  A.  Toet.  Computational  versus  psychophysical  image  saliency:  A  comparative 
evaluation  study.  IEEE  trans.  PAMI,  2011.  1 

[48]  N.  Parikh,  L.  Itti,  and  J.  Weiland.  Saliency-based  image  processing  for  retinal 
prostheses.  J.  Neural  Eng.  1,  2010.  1 

[49]  L.  Itti.  Automatic  foveation  for  video  compression  using  a  neurobiological 
model  of  visual  attention.  IEEE  Trans.  Image  Process. ,  2004.  1 

[50]  J.  Wang,  J.  Sun,  L.  Quan,  X.  Tang,  and  H.Y  Shum.  Picture  collage.  CVPR, 
1:347-354,2006.  1 

[51]  R.  Rosenholtz,  A.  Dorai,  and  R.  Freeman.  Do  predictions  of  visual  perception 
aid  design?  ACM  Transactions  on  Applied  Perception  (TAP),  2011.  1 

[52]  D.  Walther  and  C.  Koch.  Modeling  attention  to  salient  proto-objects.  Neural 
Networks,  2006.  1,  7 

[53]  D.  DeCarlo  and  A.  Santella.  Stylization  and  abstraction  of  photographs.  ACM 
Trans,  on  Graphics,  2002.  1 


Probabilistic  Learning  of  Task-Specific  Visual  Attention 


Ali  Borji  Dicky  N.  Sihite  Laurent  Itti 
Department  of  Computer  Science,  LFniversity  of  Southern  California,  Los  Angeles 

http ://il ab.usc.edu 


Abstract 

Despite  a  considerable  amount  of  previous  work  on 
bottom-up  saliency  modeling  for  predicting  human  fixations 
over  static  and  dynamic  stimuli,  few  studies  have  thus  far 
attempted  to  model  top-down  and  task-driven  influences  of 
visual  attention.  Here,  taking  advantage  of  the  sequential 
nature  of  real-world  tasks,  we  propose  a  unified  Bayesian 
approach  for  modeling  task-driven  visual  attention.  Sev¬ 
eral  sources  of  information,  including  global  context  of  a 
scene,  previous  attended  locations,  and  previous  motor  ac¬ 
tions,  are  integrated  over  time  to  predict  the  next  attended 
location.  Recording  eye  movements  while  subjects  engage 
in  5  contemporary  2D  and  3D  video  games,  as  modest  coun¬ 
terparts  of  everyday  tasks,  we  show  that  our  approach  is 
able  to  predict  human  attention  and  gaze  better  than  the 
state-of-the-art,  with  a  large  margin  (about  15%  increase 
in  prediction  accuracy).  The  advantage  of  our  approach  is 
that  it  is  automatic  and  applicable  to  arbitrary  visual  tasks. 

1.  Introduction 

Visual  attention  is  an  important  facet  of  our  vision  in 
everyday  life.  It  makes  processing  complex  visual  scenes 
tractable  through  sequential  selection  of  localized  image 
regions.  It  is  commonly  believed  that  visual  attention  is 
guided  by  two  components:  1)  a  bottom-up  (BU),  task- 
independent,  and  image-based  component  that  instinctively 
draws  the  eyes  to  places  in  the  scene  that  contain  discontinu¬ 
ities  in  image  features,  such  as  motion,  color,  and  texture, 
and  2)  a  top-down  (TD)  component  that  guides  attention 
and  gaze  in  a  task-dependent  and  goal-directed  manner,  or¬ 
chestrating  the  sequential  acquisition  of  information  from 
the  visual  environment.  In  everyday  life,  these  two  compo¬ 
nents  are  combined  in  the  control  of  gaze. 

In  computer  vision,  research  on  visual  attention  has  been 
primarily  focused  on  the  BU  component.  Early  studies  were 
directly  influenced  by  cognitive  studies  of  visual  search  and 
Feature  Integration  Theory  (FIT)  [12].  This  led  Koch  and 
Ullman  [13]  to  deflne  the  saliency  map:  A  topographic  map 
with  retinotopic  organization  where  locations  that  stand  out 
in  an  image  (e.g.,  because  of  distinctive  features  such  as 
color,  texture,  and  motion)  are  highlighted.  The  first  com- 


Figure  1.  Bottom-up  saliency  does  not  account  for  task-driven  eye  move¬ 
ments.  Predictions  of  6  state-of-the-art  BU  saliency  models  in  a  driving 
scene  and  our  model  (red  box).  Red  diamond  and  blue  circle  show  human 
fixation  and  maximum  of  a  model,  respectively.  See  Table  2  for  results. 

plete  implementation  and  verification  of  this  architecture 
was  done  by  Itti  et  al.  [7].  Several  other  approaches  for  de¬ 
tecting  image-based  outliers  have  also  been  proposed,  based 
on  information  theory  [14],  discriminant  hypothesis  [15], 
spectral  models  [16],  sparse  and  efficient  coding  [17],  and 
Bayesian  and  graphical  models  [18]  [19]  [21]. 

Today,  saliency  detection  and  eye  movement  predic¬ 
tion  over  static  images  and  videos  is  a  reasonably  well- 
researched  area  and  there  are  many  models  with  good  ac¬ 
curacy,  although  of  course  improving  tolerance  to  noise  and 
invariance  of  algorithms  is  always  possible.  However,  suc¬ 
cess  of  BU  models  is  limited  to  a  small  range  of  everyday 
tasks,  such  as  free-viewing  [  ][14][18]  and  their  adaptation 
to  visual  search  [22] [11].  BU  models  usually  can  not  pre¬ 
dict  exact  fixation  points  and  leave  more  than  half  of  them 
unaccounted  [25]  (Fig.  I).  One  problem  with  models  based 
on  saliency  maps  is  that  they  are  correlated  with  fixation  be¬ 
haviors  but  don’t  tell  much  about  the  cause  for  such  behav¬ 
iors  [26]  [27] .  Diverging  from  the  current  trend,  we  focus  on 
modeling  top-down  attention  which  can  boost  performance 
of  several  approaches  in  computer  vision.  For  example  in 
areas  such  as  object  detection  and  recognition  [15]  [35]  [36], 
especially  in  spatio-temporal  domain  for  video  understand¬ 
ing  and  action/event  recognition  (e.g.,  [37] [38]). 

We  aim  to  build  an  attentive  vision  system  that  can  tell 
where  it  should  look  as  it  moves  through  the  world  and  in¬ 
teracts  with  the  environment.  This  problem  is  very  impor¬ 
tant  but  very  difficult  and  largely  unsolved.  Our  approach  is 
to  utilize  global  visual  context  [28]  [18],  a  low-dimensional 
representation  of  the  whole  image  (the  “gist“  of  the  scene). 
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Such  a  representation  can  be  easily  computed,  and  it  relaxes 
the  need  to  identify  specific  regions  or  segment  and  recog¬ 
nize  all  objects  in  a  scene.  We  focus  on  interactive  envi¬ 
ronments  (contemporary  2D  and  3D  video  games),  where 
visual  stimuli  are  dynamically  generated  and  affected  by 
deliberative  motor  actions.  We  develop  top-down  models 
trained  over  the  same  or  similar  games  from  data  of  subjects 
during  game  playing,  and  we  use  those  models  to  predict 
saccades  of  a  new  test  subject.  Several  sources  of  multi¬ 
modal  information,  including  global  context,  previous  sac- 
cades,  and  previous  motor  actions,  are  combined  in  a  unified 
Bayesian  framework  over  time.  Compared  with  brute-force 
algorithms  such  as  the  average  of  all  saccade  positions,  a 
central  Gaussian  blob,  and  6  popular  BU  models,  we  show 
that  our  models  significantly  outperform  the  state-of-the-art 
in  terms  of  accuracy  at  predicting  where  a  new  subject  looks 
during  active  gameplay.  This  indicates  the  effectiveness  of 
our  approach  for  modeling  complex  task-driven  attention. 

Previous  Work.  The  majority  of  studies  on  TD  attention 
are  at  the  analysis/descriptive  level  and  there  are  few  com¬ 
putational  models  available,  we  believe  due  to  conceptual 
complexity.  Yarbus  [1]  discovered  a  compelling  finding  that 
’seeing’  is  inextricably  linked  to  the  observer’s  cognitive 
goals.  Task  dependency  of  gaze  has  been  extensively  stud¬ 
ied  for  several  real-world  tasks,  such  as  “sandwich  mak- 
ing“  [4],  “tea  making“  [2],  and  “driving“  [5].  These  stud¬ 
ies  have  revealed  that  most  fixations  are  directed  to  task¬ 
relevant  locations,  and  there  is  a  tight  temporal  relationship 
between  fixations  and  task-related  behaviors,  such  that  it  is 
sometimes  possible  to  infer  the  algorithm  of  a  task  from 
the  pattern  of  a  subject’s  eye  movements  (e.g.,  in  “block 
copying“  [4]).  In  [3],  Hay  hoe  and  Ballard  elaborate  on  the 
role  of  internal  reward  in  guiding  eye  and  body  movements, 
supported  by  neurophysiological  studies.  Inspired  by  the 
idea  of  visual  routines  [29]  and  using  reinforcement  learn¬ 
ing  (RL)  approaches,  Sprague  and  Ballard  [9]  proposed 
an  RL-based  top-down  attention  model  for  explaining  eye 
movements  of  an  agent  operating  in  virtual  environments. 
This  approach  is  interesting  but  suffers  from  three  limita¬ 
tions  that  make  it  hard  to  apply  directly  for  computer  vi¬ 
sion  purposes.  First,  it  is  limited  to  laboratory-scale  tasks 
such  as  side- walk  navigation  [9],  second,  visual  processing 
is  very  simple,  and,  third,  it  needs  explicit  definitions  of  re¬ 
ward  functions,  subtasks,  and  arbitration  mechanisms. 

Our  approach  has  in  part  similarities  with  the  contex¬ 
tual  model  of  Torralba  et  al  [18]  as  we  also  use  the  con¬ 
cept  of  gist.  We  start  with  a  basic  Bayesian  formulation 
and  add  new  features  to  account  for  task-driven  attention  in 
spatio-temporal  domain  while  former  has  been  thus  far  uti¬ 
lized  for  bottom-up  saliency  and  visual  search  over  static 
stimuli.  This  model  and  its  decedents  (e.g.,  [19,  11])  orig¬ 
inally  formulate  object  search  as  estimating  the  probability 
P{0  =  1,  X|L,  G)  where  X  =  (x,  y)  defines  the  location 
of  the  target  in  the  image,  O  is  a  binary  variable  (0  =  1 


denotes  target  presence  and  0  =  0  denotes  target  absence 
in  the  image),  and  L  and  G  denote  local  and  global  features, 
respectively.  According  to  Bayes’  theorem,  they  expand  the 
above  probability  as: 

p(0  =  1,Y|L,G)  = 

-p^P(L|0  =  1,  X,  G)P{X\0  =  1,G)P{0  =  1|G) 

The  first  term  on  the  right,  p(j^\g)  ’  independent  of  the 
target,  measures  BU  saliency,  and  is  solely  dependent  of  lo¬ 
cal  image  features.  The  second  term  represents  top-down 
knowledge  of  target  appearance.  Image  regions  with  fea¬ 
tures  likely  to  belong  to  the  target  object  are  enhanced.  The 
third  term  provides  context-based  priors  on  the  location  of 
the  target,  and  the  fourth  term  provides  the  prior  probabil¬ 
ity  of  presence  of  the  target  in  the  scene.  If  this  probabil¬ 
ity  is  very  small,  then  object  search  need  not  be  initiated. 
Zhang  et  al  [19]  used  Independent  Component  Analysis 
(ICA)  and  Difference  of  Gaussians  (DOG)  features  learned 
from  a  large  repository  of  natural  scenes  to  estimate  the  first 
term.  From  another  perspective  these  models  unify  the  in¬ 
formation  theoretic  models  (e.g.,  [14]),  in  the  sense  both  are 
based  on  self- similarity  of  scene  regions.  These  models  as¬ 
sign  higher  saliency  values  to  regions  with  rare  features.  In¬ 
formation  of  visual  feature  F  is  I{F)  =  —log  F{F)  which 
is  inversely  proportional  to  the  likelihood  of  observing  F. 
By  fitting  a  distribution  F{F)  to  features,  rare  features  can 
be  immediately  found  by  computing  F{F)~^  in  an  image. 
The  idea  of  global  context  has  also  been  extensively  em¬ 
ployed  in  several  areas  of  computer  vision  (e.g.,  [32]  [10]). 

Several  other  approaches  have  been  proposed  to  model 
top-down  attention,  specifically  for  visual  search.  Naval- 
pakkam  and  Itti  [22]  proposed  a  Bayesian  approach  to  de¬ 
rive  the  optimal  gains  that  should  be  applied  to  low-level  vi¬ 
sual  features  contributing  to  a  saliency  model  [7],  to  make 
an  object  of  interest  more  salient.  The  objective  was  to  max¬ 
imize  the  signal  to  noise  ratio  of  the  expected  target  ob¬ 
ject  versus  background  clutter,  and  training  was  performed 
over  a  set  of  natural  scenes  containing  ground-truthed  ob¬ 
jects.  An  intuitive  solution  for  the  same  problem  (optimal 
gains  of  feature  channels)  was  suggested  earlier  by  Frin- 
trop  [23]  which  is  the  end  result  of  the  SNR  maximization 
process  in  [22].  Navalpakkam  and  Itti  [8]  proposed  concep¬ 
tual  guidelines  for  modeling  the  role  of  task  on  visual  atten¬ 
tion,  but  their  method  requires  the  algorithm  of  the  task  to 
be  known,  and  is  not  fully  implemented. 

Perhaps  the  most  similar  work  to  ours  (i.e.,  real-world 
and  unconstrained  tasks)  is  the  work  by  Peters  and  Itti  [6], 
where  they  used  gist  as  a  predictor  of  fixation,  learning  from 
examples  where  people  looked  in  scenes  of  different  gists 
and  while  engaged  in  a  particular  task.  The  same  scene  gist, 
however,  might  not  always  warrant  the  same  eye  movement, 
based  on  the  history  and  sequence  of  previous  fixations  and 


actions  to  date.  For  example,  in  one  of  the  games  stud¬ 
ied  here,  even  when  looking  at  the  exact  same  scene,  eye 
movements  are  often  guided  by  past  events,  such  as  differ¬ 
ent  customers  placing  different  orders  for  items  which  the 
player  is  asked  to  provide.  To  tackle  the  problem  that  gist  of 
the  scene  is  not  enough,  we  follow  a  sequential  processing 
framework  where  several  factors  predictive  of  eye  move¬ 
ments  are  integrated  over  time  and  can  resolve  the  confu¬ 
sion  (aliasing)  at  one  snapshot  of  time. 

2.  Proposed  Model 

Our  goal  is  to  predict  where  a  human  subject  attends  un¬ 
der  the  task  influence  T.  This  is  similar  to  explaining  sac- 
cades  (jumps  in  eye  movements)  in  free- viewing,  addressed 
by  bottom-up  models,  with  the  difference  that  here  a  policy 
governs  saccades.  Since  it  is  difficult  to  learn  general  strate¬ 
gies  for  performing  every  task,  here  we  focus  on  learning 
models  for  each  task  separately.  Following  a  leave-n-out 
approach  over  subjects,  first,  in  the  training  phase,  we  com¬ 
pile  a  training  set  of  feature  vectors  and  eye  positions  cor¬ 
responding  to  individual  frames  from  several  video  game 
clips  which  were  recorded  while  observers  were  playing 
video  games.  Then,  training  data  is  used  to  learn  probability 
distributions  over  image  locations  for  given  feature  vectors, 
and  pdfs  are  later  leveraged  in  the  test  phase  for  inferring 
the  next  attended  location  of  a  new  test  subject. 

We  need  a  number  of  variables  that  cause  or  correlate 
with  saccade  positions  and  hence  can  provide  information 
regarding  the  next  saccade  location.  These  variables  tell 
us  indirectly  about  the  state  of  the  agent  at  each  time  point 
of  the  task.  In  addition  to  scene  gist,  here,  we  introduce 
two  new  features  explained  below:  motor  actions  and  pre¬ 
vious  saccade  position  and  then  combine  them  in  a  proba¬ 
bilistic  manner  over  time  to  infer  a  probability  distribution 
over  scene  locations  that  may  attract  next  saccade: 

Global  context  (Gist,  G).  Following  a  brief  presenta¬ 
tion  of  a  photograph,  humans  are  able  to  summarize  the 
quintessential  characteristics  of  an  image,  a  process  previ¬ 
ously  expected  to  require  much  analysis.  A  number  of  mod¬ 
els  exist  for  calculating  Gist  (e.g.,  [18]  [33]).  We  adopt  the 
gist  model  of  [10]^  as  it  is  based  on  the  bottom-up  saliency 
model  [7]  that  we  use  here  as  a  baseline  approach.  We 
consider  4  scales  for  each  orientation  pyramid,  6  scales  for 
each  color  pyramid,  and  6  scales  for  intensity.  For  each 
of  the  maps,  average  in  each  of  the  patches  of  grid  sizes 
n  X  n  (here  n  G  {1,2,4})  are  calculated  (thus  21  val¬ 
ues).  Overall  the  final  gist  vector  will  be  augmentation  of 
(4x4  +  6x2  +  6xl)x21  =  714  values.  We  then  employ 
PC  A  to  reduce  the  dimensionality.  We  also,  investigate  the 
ability  of  histogram  of  oriented  gradient  (HOG)  [30]  fea¬ 
tures  to  represent  the  global  context  of  a  scene^. 

^http ://ilab.usc. edu/ siagian/ Re search/ Gist /Gist . htmi 

^http : / / pascai . inriaipes .fr/soft/olt/ 


Previous  saccade  location  (X).  A  lot  of  everyday  tasks 
need  a  number  of  perceptions  and  actions  to  be  performed  in 
a  sequence  (e.g.,  sandwich  making  [4]).  Therefore,  know¬ 
ing  what  object  has  been  attended  previously  gives  an  evi¬ 
dence  for  the  next  attended  object.  We  implement  this  idea 
over  spatial  locations.  For  instance,  P(X^+^  =  b\X^  =  a) 
indicates  the  probability  of  looking  at  location  b  in  the  next 
time  step  given  that  location  a  is  currently  fixated  (e.g., 
looking  at  left  first  and  then  right,  when  turning  right). 

Motor  actions  (A).  Actions  and  fixations  are  tightly 
linked  thus,  by  knowing  a  performed  action,  one  can  tell 
where  to  look  next.  We  recorded  motor  actions  while  hu¬ 
mans  were  involved  in  game  playing.  We  assumed  that 
these  actions  correspond  to  some  high-level  events  in  the 
game  (e.g.,  mouse  click  for  shooting).  We  logged  actions 
for  driving  games  (e.g.,  wheel  position,  pedals  (brake  and 
gas),  left  and  right  signals,  mirrors,  left  and  right  side  views, 
and  gear  change),  from  which  we  only  generated  a  2D  fea¬ 
ture  vector  from  wheel  and  pedal  positions  between  0  and 
255  (Fig.  2).  For  other  games,  2D  mouse  position  and  joy¬ 
stick  buttons  were  used  (further  explained  in  Sec.  3.1). 

2.1.  Problem  Formulation 

In  this  section,  we  describe  details  of  our  Bayesian  ap¬ 
proach  of  information  integration  over  time  to  predict  sac¬ 
cade  in  the  next  time  step.  Our  method  is  based  on  Hidden 
Markov  Models  (HMM),  which  are  successful  probabilistic 
tools  for  sequence  processing.  We  are  particularly  inter¬ 
ested  in  the  probability  of  attending  to  spatial  location  X 
given  all  available  information  /,  or  P(A|/).  One  way  to 
estimate  this  probability  is  to  follow  a  discriminative  ap¬ 
proach  by  augmenting  all  information  into  a  large  vector  /, 
and  using  a  classifier  to  map  it  to  X  from  a  set  of  labeled 
training  data.  An  alternative  is  to  follow  a  Bayesian  formu¬ 
lation:  P{X\I)  =  P{I\X)P{X)/P{I)  =  /iP(/|X)P(A). 
Parameter  /i  is  selected  in  a  way  that  resultant  probabilities 
sum  to  1  (i.e.,  P{Xj\I)  =  1).  P{X)  is  simply  the  prior 
distribution  of  all  saccade  locations  in  the  training  data  (sum 
of  all  saccades  or  average  fixation  map).  A  benefit  of  the 
generative  approach  over  the  discriminant  classifier-based 
approach  is  that,  it  provides  a  unified  method  for  informa¬ 
tion  integration  of  sequential  data,  and  makes  it  suitable  for 
our  purpose,  which  enhances  results. 

Formally,  the  goal  of  the  saccade  prediction  is  to  com¬ 
pute  a  probability  distribution  over  the  possible  locations 
given  all  features  up  to  time  t.  Let  Xf  G  {1 . . .  n}  denote 
the  saccade  location  with  n  as  the  number  of  locations  in 
the  image  at  time  t.  To  generate  sufficient  data,  we  resize 
the  original  eye  fixation  map  with  one  at  the  attended  loca¬ 
tion  and  zeros  elsewhere  into  a  smaller  scale  map  (slw  x  h 
grid).  Therefore,  Xt  is  the  location  of  1  in  such  map.  In  the 
following  we  start  with  the  simplest  case  of  P(X|/)  when 
only  global  context  information  is  available  (i.e.,  /  is  equal 
to  Gist)  and  add  more  information  in  subsequent  steps. 


Case  1:  Gist  only.  In  this  case,  only  global  context  infor¬ 
mation  from  all  past  and  the  current  time  is  used.  According 
to  the  Bayes  theorem  we  have: 

P{Xt\Gi:t)  =  P{Xt\Gt,Gi..t-i) 

^  P{Gt\Xt)P{Xt\Gi..t-i) 

P{Gt\Gv.t-i) 

=  lj,PiGt\Xt)PiXt\Gv.t-i) 

Following  Markov  assumption,  the  current  scene  Gist  (Gt) 
has  all  the  necessary  information  for  determining  state  and 
knowing  the  attended  location.  Thus  Xt  is  independent  of 
all  previous  gists:  P{Xt\Gi-t-i)  =  P{Xt).  Therefore,  we 
can  write:  PiXt\Gi,t)  =  iiP{Gt\Xt)PiXt)  with  P(Xt) 
as  the  prior  distribution  over  eye  positions. 

Case  2:  Gist  and  previous  saccade.  In  the  second  step, 
we  add  the  previous  saccade  locations  to  the  formulation: 

P{Xt\Gl:t,Xut-l) 

=  P{Xt\Gt,Gv.t-i,Xi..t-i) 

^  P{Gt\Xt)P{Xt\Gi:t-i,X^..t-i) 

PiGt\Gv.t-i,X,..t-i)  '  ^ 

=  fiiP{Gt\Xt)P{Xt\Gv.t-i,Xi..t-i) 

=  fllfl2P{Gt\Xt)P{Xt-l\Xt)P{Xt\Gl:t-l,Xl,t-2) 

where  /xi  is  equal  to  P{Gt\Gi-,t-i,Xi.,t-i)~^  and 
is  P(Xt_i|Gi:(_i,Xi:(_2)“^.  Again,  considering 
Markov  assumption  and  defining  ji  =  we  have: 

=/iP(G,|X,)P(X,_i|X,)P(X,). 

Case  3:  Gist,  previous  saccade,  and  motor  actions. 

Finally,  we  combine  all  evidences  in  our  Bayesian  model. 
Following  the  steps  in  case  2  and  simplifying  we  reach  to: 

P(A,|Gl:,,Al:,_l,Aj=iT) 

=  ^iP{Gt\Xt)P{Xt-i\Xt)P{Xt)  X  U]=,P{Ai_,\Xt) 

Above  formula  assumes  that  actions  are  independent  of 
each  other  given  attended  location  (i.e.,  A^±A^\X).  An 
important  point  here  is  whether  actions  infiuence  saccades 
or  vice-versa.  In  the  real  world  the  interaction  works  both 
ways:  for  some  situations/tasks,  saccades  lead  actions,  how¬ 
ever,  sometimes  actions  can  also  lead  eye  movements.  Here 
to  be  on  the  safe  side,  we  did  not  use  the  current  action. 

Computing  (4)  requires  estimation  of  P{Gt\Xt)  and 
similarly  others.  This  can  be  done  in  several  ways  using 
non-parametric  probability  density  estimation  techniques 
such  as  generalized  Gaussian  model,  histogram  estimation 
or  kNNs.  We  adapted  the  Kernel  Density  Estimation  (KDE) 
approach.  One  pdf  is  calculated  for  each  spatial  location: 


P{G\xi)  =  -  '^Ghix-  Xi) 


1 

mh 


(5) 


i=l  i=l 

where  Gh  is  a  Gaussian  kernel  with  smoothing  parameter 
(sliding  window  or  bandwidth)  h  and  m  is  number  of  data 
points.  We  used  a  Matlab  toolbox^  for  implementing  KDE. 


^Publicly  available  at:  http://www.ics.uci.edu/~ihler/code/ 
kde . html 


2.2.  Baseline  Benchmark  Models 

To  fully  evaluate  effectiveness  of  our  model,  we  imple¬ 
mented  the  regression  model  put  forward  by  Peters  and 
Itti  [6]  as  well  as  a  nearest-neighbor  classifier  and  two  other 
brute-force  yet  powerful  models. 

Linear  Regression  (REG).  This  model  does  not  take 
into  account  the  temporal  progress  of  a  task  and  simply 
maps  Gist  of  the  scene  to  the  eye  position.  Mathematically, 
the  goal  is  to  optimize  the  following  objective  function: 


argmin  ||M  xW-  Xsacc\\^ 
w 

Subject  to  :  W  >  0. 


(6) 


where  M  indicates  the  matrix  of  feature  vectors  (only  Gist 
feature  is  used  in  [6])  and  X  is  the  matrix  of  eye  positions 
(one  fixation  per  frame).  The  least-squares  solution  of  the 
above  objective  function  is:  IK  =  M+  x  X,  where  M+ 
is  the  pseudo-inverse  of  matrix  M  through  SVD  decompo¬ 
sition.  In  our  experiments,  we  only  take  the  largest  eigen¬ 
value  of  the  SVD  since  this  avoids  numerical  instability  and 
results  in  higher  accuracy.  Given  vector  E  =  {u,v)  as  the 
eye  position  over  a  20  x  15  map  (i.e.,  w  =  20,  h  =  15) 
with  u  e  [1,  20]  and  v  G  [1, 15],  the  gaze  density  map  can 
then  be  represented  by  vector  X  =  [xi,  X2, . . . ,  ^soo]  with 
Xi  =  1  for  i  =  u  {v  —  1)  X  20  and  Xi  =  0  otherwise. 
Finally,  for  each  test  frame,  we  compute  feature  vector  F 
and  generate  the  predicted  map  P  =  F  x  W  which  is  then 
reshaped  to  a  20  x  15  saliency  map.  The  maximum  of  this 
map  is  used  to  direct  spatial  attention. 

k  Nearest  Neighbor  Classifier  (kNN).  We  also  imple¬ 
mented  a  non-linear  mapping  from  features  to  saccade  lo¬ 
cations.  The  attention  map  for  a  test  frame  is  built  from 
the  distribution  of  fixations  of  its  most  similar  frames  in  the 
training  set.  For  each  test  frame,  k  most  similar  frames  (us¬ 
ing  the  Euclidean  distance)  were  found  and  then  the  pre¬ 
dicted  map  was  the  weighted  average  of  the  fixation  loca¬ 
tions  of  these  frames  (i.e.,  A*  =  ^ 

where  X^  is  the  fixation  map  of  the  j  —  th  most  similar 
frame  to  frame  i  which  is  weighted  according  to  its  simi¬ 
larity  to  frame  i  in  feature  space  (i.e.,  D{F\  F^)~^).  We 
chose  parameter  k  to  bo  10  which  resulted  in  good  perfor¬ 
mance  over  train  data  as  well  as  reasonable  speed. 

In  addition  to  the  above,  we  also  devised  two  brute-force 
yet  powerful  predictors.  The  first  one  is  simply  the  average 
of  all  saccade  positions  which  we  call  Average  Fixation 
Map  (AFM)  during  the  time  course  of  a  task  over  all  m 
training  frames  (i.e.,  AFM  =  ^  dynamic 

environments  used  in  this  paper,  since  frames  are  generated 
on  the  fly  and  there  are  few  fixations  per  frame,  aligning 
frames  (contrary  to  movies)  is  not  possible.  If  a  method 
could  dynamically  predict  eye  movements  on  a  frame-by- 
frame  basis,  then  achieving  a  higher  accuracy  than  AFM 
is  possible.  AFM  map  is  also  the  solution  of  the  regres¬ 
sion  with  a  constant  input,  and  is  the  output  of  our  Bayesian 


Figure  2.  Correlation  between  actions  and  saccade  positions.  Rows  indi¬ 
cate  events  (each  frame  was  manually  tagged  based  on  its  event).  Columns 
from  left  to  right  include:  wheel  vs.  eye  —  x,  eye  —  y  vs.  wheel,  sac- 
cade  coordinates  during  the  dame  (eye  —  x  vs.  eye  —  y),  and  frequency 
of  pedal  positions  for  DS  game.  Blue  ellipses  in  the  3rd  column  indicate 
objects  in  the  scene  (see  Fig.  1).  Similar  trends  happen  in  the  other  games 
which  eventually  could  help  us  in  prediction  of  next  saccade  location. 


model  with  one  variable  (P{X)  only).  The  second  predictor 
is  a  central  Gaussian  filter  (Gauss).  The  rationale  behind 
using  this  model  is  that  humans  tend  to  look  at  the  center  of 
the  screen  when  game  playing  (center-bias  or  photographer- 
bias  issue  [31]  by  game  design  construction),  therefore  a 
central  Gaussian  blob  may  score  well  when  datasets  are  cen¬ 
trally  biased  (See  Figs.  3  and  6).  Instead  of  using  a  fixed- 
size  Gaussian  for  all  games,  we  fitted  a  2D  Bivariate  Gaus¬ 
sian  to  the  fixation  data  of  each  game  using  ML  algorithm: 


f{x,y)  = 


27rf7a,(72^  ^  2(l-p2)J 

where  z  =  _  ‘^p{x-f^x)(y-f^y)  ,  (y-Pyf 


the  correlation  coefficient  between  x  and  y  (i.e.,  p 

where  is  the  covariance  matrix. 

^ ^  ^  y 


(7) 


and  p  is 


CTxCr. 


_  ^xy 


) 


3.  Quantitative  Results 

Here  we  report  results  of  our  approach  for  predicting  sac- 
cades  (jumps  in  eye  movements  to  bring  the  relevant  ob¬ 
ject/location  to  the  fovea)^.  While  we  only  process  those 
frames  in  which  a  saccade  happened,  our  method  is  easily 
applicable  for  predicting  fixations  (one  for  each  frame). 


3.1.  Eye  Tracking  and  Data  Gathering 

To  test  our  models,  we  have  collected  a  large  amount  of 
multi-modal  data  from  subjects  playing  video  games.  We 
intend  to  share  our  data  and  accompany  software  to  encour¬ 
age  follow-up  research  on  modeling  top-down  attention. 

Human  subjects  in  the  age  range  20  to  30  played  5  video 
games.  Subjects  were  students  of  anonymous  university. 
Some  subjects  played  more  than  one  game.  First,  in  a  5  min 
training  session,  aim  and  rules  of  the  game  as  well  as  but¬ 
tons  of  playing  device  were  explained  to  the  subject.  Sub¬ 
jects  were  then  asked  to  play  the  game  to  become  famil¬ 
iar  with  the  gaming  environment.  After  training,  in  a  test 

^Thresholds  to  detect  saccades  were  set  to  a  velocity  of  20° /s  and  an 
amplitude  threshold  of  2°  /  s. 


session,  subjects  played  a  different  scenario  of  the  game 
than  during  training  (e.g.,  a  different  game  level)  without 
experimenter’s  intervention.  They  had  different  adventures 
in  games  from  each  other.  Before  the  test  session  started, 
the  eye  tracker  (ISC AN  Inc.  RK-464)  was  calibrated  using 
9  point  calibration  scheme.  Subject’s  head  was  placed  on 
a  chin-rest  at  the  distance  of  130cm  from  the  screen,  yield¬ 
ing  a  visual  field  of  43^  x  25^.  Subject’s  right  eye  was 
recorded.  Along  with  frames  and  fixations,  subject’s  ac¬ 
tions  were  also  logged.  A  computer  with  Windows  OS  ran 
the  PC  games  (frame  rate  30Ff2;),  logged  actions  (frequency 
62Hz),  and  sent  frames  to  a  computer  with  Linux  Mandriva 
OS  that  displayed  and  saved  frames  for  later  analysis.  An¬ 
other  windows  machine  controlled  the  eye  tracker  camera 
and  recorded  fixations  (240Hz).  All  computers  communi¬ 
cated  via  a  LAN  network  and  their  clocks  were  synchro¬ 
nized.  Each  data  item  had  a  time  stamp  which  allowed  us  to 
align  frame,  action,  and  fixation  data  after  recording. 

Stimuli.  To  evaluate  the  power  of  our  model,  we  applied 
it  to  5  games  with  different  task  algorithms  and  visual  ren¬ 
derings.  For  some  games,  scenes  change  considerably  but 
for  some  others  background  scene  is  nearly  constant  mak¬ 
ing  gist  features  less  variable  and  informative. 

Two  of  the  games  are  driving  games.  The  first  one,  3D 
Driving  School  (DS)  is  a  driving  emulator  with  simulated 
traffic  conditions.  Players  must  follow  the  route  and  Euro¬ 
pean  traffic  rules  defined  by  the  game.  An  instructor  will 
tell  the  players  where  to  go  by  a  text  in  a  semi-translucent 
box  above  the  screen  and/or  a  small  arrow  on  the  top-left 
comer.  Players  use  automatic  transmission  to  drive  around 
the  entire  course.  This  game  has  only  dashboard  view,  an 
inside  view  from  the  driver-side  towards  the  road.  The  sec¬ 
ond  driving  game,  18  Wheels  of  Steel  (WS)  is  a  semi/tmck 
simulator.  In  this  game,  players  control  a  big  rig  to  a  spe¬ 
cific  destination,  to  retrieve  money  rewards  for  delivering  a 
trailer.  Players  must  drive  carefully  as  the  tmck  cannot  ac¬ 
celerate/brake  suddenly  due  to  its  mass.  In  this  game,  play¬ 
ers  were  told  to  always  make  a  left  turn  since  there  is  no  ex¬ 
plicit  instruction  on  the  screen  telling  where  to  go.  Players 
also  used  first-person/bumper  view.  Correlations  between 
fixation  patterns  and  driving  events  were  found  that  can  help 
detecting  driver  behavior’s  and  intention  (Fig.  2).  Fig.  3 
shows  the  average  fixation  map  for  DS  game  and  it’s  corre- 


Average  Fixation  Map  (AFM)  -  DS  game  Fitted  2D  Gaussian  Pdf  (Gauss) 


Figure  3.  Average  fixation  map  for  the  DS  game  and  it’s  corresponding 
learned  Gaussian  map  y  =  [8.75  10.5]  and  =  [7.85  0.52;  0.52  14.5]. 


NSS  score 


AUC  score 


ROC  curve 


Game 

#  Sacc. 

#  Subj 

Dur. 

#  Frames 

Size 

Action 

DS 

6382 

10 

10  min 

180K 

no 

J 

WS 

4849 

10 

10” 

180K 

no 

J 

SM 

1482 

5 

5” 

45K 

26 

J 

BS 

1763 

5 

5  ” 

45K 

26 

M 

TG 

4602 

12 

~  4.5  ” 

99K 

57 

N/A 

Table  1 .  Summary  statistics  of  our  data  including  overall  number  of  sac- 
cades,  subjects,  durations  per  subject,  frames,  sizes  in  GB,  and  action  types 
(J  indicates  joystick  and  M  stands  for  mouse). 


spending  fitted  Gaussian  model. 

The  third  game,  Super  Mario  Bros  (SM),  is  a  classic  2D- 
side-scrolling  action  game.  Players  control  Mario  to  a  fiag- 
pole  to  finish  the  level.  Mario  grows  bigger  if  it  consumes 
a  mushroom  and  can  shoot  fireballs  if  it  consumes  a  fiower. 
There  are  various  enemies  that  can  be  killed  by  stomping 
on  them  or  shooting  fireballs.  In  this  game,  players  were 
expected  not  to  take  any  means  of  shortcut  such  as  running 
on  ceiling,  teleport  pipes,  or  warp  points.  Actions  in  this 
game  are  (x^y)  position  of  joystick  ([0,  255]  for  left/right, 
up/bottom)  and  status  of  3  binary  buttons  including  Start, 
Jump,  and  Fire/Run. 

The  fourth  game  called  Burger  Shop  (BS)  di  2D  time- 
management  game.  Under  time  pressure,  players  serve  cus¬ 
tomers  who  order  food  items  such  as  burgers  and  fries  that 
must  be  assembled  from  a  conveyor  belt  that  brings  ingre¬ 
dients.  The  game  ends  when  all  customers  are  served.  For 
this  game,  actions  include  mouse  (x,  y)  position  as  well  as 
status  of  the  mouse  buttons  (i.e..  Left,  Middle,  and  Right). 

The  fifth  game.  Top  Gun  (TG),  is  a  flight-combat  sim¬ 
ulator.  Players  control  a  jet-fighter  plane  that  can  lock  tar¬ 
gets  and  shoot  missiles,  use  afterburners  to  speed  up,  and  do 
air  maneuvers.  The  main  objective  of  the  game  is  to  com¬ 
pletely  destroy  all  targets  on  air  and  on  the  ground.  Players 
use  first-person  view  in  this  stimuli.  Currently,  we  do  not 
have  motor  actions  for  this  game. 

Table.  1  shows  summary  statistics  of  video  game  data. 


3.2.  Evaluation  Metrics 

To  quantify  how  well  model  predictions  matched  ob¬ 
servers’  actual  eye  positions,  we  used  two  metrics: 

Normalized  Scanpath  Saliency  (NSS).  NSS  [34]  is 
defined  as  the  response  value  at  the  human  eye  position 
(x/i,  yh)  in  a  model’s  predicted  gaze  density  map  that  has 
been  normalized  to  have  zero  mean  and  unit  standard  devi¬ 
ation:  NSS{t)  =  (^s{x{t))  —  ys(t))  for  frame  at  time  t. 

An  NSS  value  of  unity  indicates  the  subject’s  eye  position 
falls  on  a  region  whose  predicted  density  is  one  standard  de¬ 
viation  above  average.  Meanwhile,  an  NSS  value  of  zero  or 


Number  of  Gist  vector  dimensions  preserved  through  PCA 

Figure  4.  Grid  search  for  best  parameters  (KDE  kernel  width  and  PCA  dimensions 
of  the  Gist  vector;  Sec.  2)  for  DS  game  over  train  data. 


Figure  5.  Prediction  accuracy  of  our  KDE  model,  Itti  et  al  [  ],  classifiers 
also  implemented  here,  as  well  as  brute-force  predictors  (AFM  and  Gaus¬ 
sian)  for  5  video  games  using  NSS  and  AUC  (ROC)  scores.  KDE  model 
with  all  features,  KDE  (All),  results  in  the  best  performance  in  all  cases, 
KDE  with  only  Gist  feature  outperforms  the  other  compared  models. 

lower  means  that  the  model  performs  no  better  than  picking 
a  random  position  on  the  map. 

Area  Under  the  Curve  (AUC).  Here,  a  model’s  saliency 
map  is  treated  as  a  binary  classifier  on  every  pixel  in  the  im¬ 
age;  pixels  with  larger  saliency  values  than  a  threshold  are 
classified  as  fixated  while  the  rest  of  the  pixels  are  classified 
as  non-fixated.  Human  fixations  are  used  as  ground  truth. 
By  varying  the  threshold,  the  Receiver  Operating  Charac¬ 
teristic  (ROC)  curve  is  drawn  as  the  false  positive  rate  vs. 
true  positive  rate,  and  the  area  under  this  curve  indicates 
how  well  the  saliency  map  predicts  actual  human  eye  fixa¬ 
tions  [14].  Perfect  prediction  corresponds  to  a  score  of  1. 

Results.  In  the  first  experiment,  we  trained  the  model 
over  each  separate  game.  Each  game  segment  has  a  vari¬ 
able  number  of  saccades  for  each  subject.  Training  was 
done  over  saccades  ofK  —  1  subjects  and  tested  over  sac- 
cades  of  the  remaining  test  subject.  In  each  training  phase, 
the  best  kernel  width  and  PCA  dimensions  of  gist  vector 
(see  Sec.  2)  were  found  using  grid  search.  Fig.  4  shows 


Game 

ICL  [17] 

SDSR  [20] 

GBVS  [24] 

AIM  [14] 

SUN  [19] 

Gauss  [31] 

AFM 

KDE(C-l) 

KDE(C-2) 

KDE(C-3) 

DS 

0.57 

0.54 

0.73 

0.62 

0.658 

0.76 

0.78 

0.82 

0.82 

0.82 

0.19 

0.05 

0.948 

0.54 

0.30 

1.47 

1.66 

1.9 

1.91 

1.95 

WS 

0.52 

0.41 

0.73 

0.55 

0.51 

0.76 

0.81 

0.83 

0.83 

0.84 

0.27 

-0.2 

1.25 

0.66 

0.19 

1.64 

1.9 

2.18 

2.21 

2.46 

SM 

0.61 

0.69 

0.72 

0.67 

0.62 

0.67 

0.75 

0.78 

0.79 

0.79 

0.59 

0.74 

1.21 

0.77 

0.  33 

0.62 

1.07 

1.13 

1.21 

1.11 

BS 

0.72 

0.61 

0.73 

0.69 

0.72 

0.72 

0.76 

0.79 

0.81 

0.84 

1.04 

0.54 

1.1 

0.80 

1.2 

0.96 

1.89 

2.1 

2.2 

2.7 

TG 

0.62 

0.5 

0.622 

0.6 

0.6 

0.6 

0.73 

0.75 

0.75 

- 

0.58 

0.01 

0.55 

0.51 

0.29 

0.57 

1.28 

1.36 

1.34 

- 

Table  2.  AUC(lst  rows)  and  NSS  scores(2nd  rows)  of  5  state-of-the-art 
models  and  ours  over  our  data.  Numbers  in  bold  show  best  two  models  in 
each  row.  In  almost  all  cases,  while  other  models  fall  below  Gaussian  and 
AFM  models,  KDE  (All)  scores  the  best.  In  some  cases,  regression  and 
KNN  may  score  the  best  (cf.  Fig.  5).  C-x  stands  for  Case  x  (See  Sec.  2.1). 


an  example  of  best  parameters  over  one  training  session  of 
the  DS  game.  Fig.  5  shows  NSS  and  AUC  scores,  as  well 
as  ROC  curves  for  baseline  models  and  all  variants  of  our 
model  for  each  individual  game.  Over  all  games,  KDE  with 
all  features  (case  3)  resulted  in  the  best  performance  fol¬ 
lowed  by  case  2:  KDE  (Gist  -i-  Prev.  sacc).  KDE  with  only 
Gist  feature  outperformed  classifiers  with  Gist,  which  indi¬ 
cates  advantage  of  the  KDE  approach  for  using  this  feature 
compared  with  regression  [6].  Random  predictor  (a  ran¬ 
dom  value  for  each  location)  has  zero  NSS  and  AUC  near 
0.5.  AFM  predictor  achieved  higher  scores  than  BU  [  ]  and 
Gaussian  models  over  all  games,  indicating  that  eye  move¬ 
ments  were  likely  mostly  guided  top-down  and  BU  influ¬ 
ences  were  weak.  AFM  outperformed  classifiers  over  the 
SM  game,  indicating  that  Gist  is  not  a  good  predictor  for 
this  game;  but  when  we  added  previous  saccade  position 
and  actions,  classifiers  and  KDE  performed  the  best.  Using 
action  features  alone  in  the  kNN  classifier  resulted  in  NSS s 
of  1.41  and  1.80  for  the  DS  and  WS  games,  respectively 
higher  than  Gaussian  and  close  to  AFM  of  each  game. 

Table  2,  shows  accuracy  of  5  state-of-the-art  bottom- 
up  saliency  models,  Gaussian,  and  AFM.  In  previous  re¬ 
search,  these  models  have  achieved  the  highest  scores  over 
eye  movement  datasets  for  free-viewing  task.  Here,  almost 
all  of  these  models  perform  worse  than  AFM,  while  our  ap¬ 
proaches  (KDE  (All)  and  KDE  (Gist))  perform  higher  with 
a  large  margin.  This  again  indicates  that,  while  bottom-up 
saliency  models  fail  to  account  for  eye  fixations  in  our  tasks 
which  have  a  strong  top-down  component,  our  new  models 
are  able  to  capture  a  large  amount  of  task-driven  saccades. 


In  the  second 
experiment,  we 
trained  the  KDE 
models  over  one 
of  two  driving 
games  and  tested 
it  on  the  other  to 


Train  on 

DS 

ws 

Test  on 

ws 

DS 

AFM 

0.80(1.74) 

0.75  (1.51) 

KDE  (Gist) 

0.80(1.64) 

0.74(1.40) 

KDE  (All) 

0.79  (1.62) 

0.73  (1.51) 

Table  3.  Confusion  matrix  of  training  models 
on  one  driving  game  and  applying  it  to  the 
other  one  using  AUC  and  NSS  (parenthesis). 


assess  the  generalization  power  of  our  approach  over  dif¬ 
ferent  tasks.  As  Table  3  shows,  training  on  a  similar  game 


Gist  [10] 

HOG  [30] 

Game 

kNN 

REG 

kNN 

REG 

DS 

0.80(1.77) 

0.8  (1.86) 

0.81  (1.88) 

0.81  (2.05) 

SM 

0.75  (0.88) 

0.76(1.01) 

0.74  (0.97) 

0.79  (1.23) 

Table  4.  Comparing  AUC  and  NSS  (in  parenthesis)  of  Gist  model  of  Sia- 
gian  et  al  [10]  and  HOG  features  for  saccade  prediction  using  kNN  and 
regression  classifiers  for  3D  Driving  School  and  Super  Mario  games.  Di¬ 
mensionality  of  Gist  vector  is  714  and  dimensionality  of  HOG  is  4800. 
Only  for  REG  (HOG)  dimensionality  of  HOG  is  reduced  to  95%  of  its 
variance  which  preserved  about  900  D  for  DS  and  500  for  Mario  game. 

results  in  higher  accuracy  than  random,  and  close  to  per¬ 
formance  of  Gaussian  and  AFM  predictors  of  each  game 
shown  in  Table  2.  Applying  the  AFM  of  games  to  each 
other  resulted  in  higher  accuracy  than  the  KDE  models, 
probably  because  one  constant  in  both  games  is  that  sub¬ 
jects  look  at  the  center.  Since  actions  and  sequence  of  fixa¬ 
tions  are  specific  for  each  game,  adding  them  slightly  drops 
the  performance  (KDE  (Gist)  vs.  KDE  (All)). 

In  the  third  experiment,  we  aimed  to  compare  the  power 
of  HOG  features  [30]  and  the  Gist  features  of  Siagian  et 
al  [10].  The  notion  behind  using  HOG  features  is  that  they 
encode  rich  structural  information  from  the  entire  scene 
and  have  been  very  successful  in  object  detection.  Table  4 
shows  the  performance  of  kNN  and  regression  classifiers 
over  DS  and  SM  games.  HOG  features  were  better  descrip¬ 
tors  of  the  scene  and  conveyed  more  information  regarding 
saccade  locations  over  both  games  and  using  both  classi¬ 
fiers.  However,  because  calculating  8  orientation  channels 
in  HOG  makes  it  slower  than  gist  in  [10]  (about  2  times) 
which  uses  4,  here  we  performed  experiments  using  the  sec¬ 
ond  one.  HOG  also  generates  high  dimensional  feature  vec¬ 
tors  which  makes  it  hard  to  store  and  work  with. 

Figure  6  shows  sample  frames  of  video  games  with  cor¬ 
responding  saliency  maps  from  models.  Predicted  maps  by 
our  models  show  dense  activity  at  task  relevant  locations 
thereby  narrowing  attention  and  leading  to  higher  NSS  and 
AUC  scores.  These  maps  change  per  frame  as  opposed  to 
the  static  AFM  and  Gaussian  models. 

4.  Discussion  and  Future  Work 

We  proposed  a  unified  Bayesian  approach  that  is  appli¬ 
cable  to  a  large  class  of  everyday  tasks  where  global  scene 
knowledge,  the  sequence  of  fixated  locations,  and  actions, 
constrain  future  eye  fixations.  In  addition  to  the  above- 
mentioned  factors,  there  might  be  other  general  features  in¬ 
fluencing  task-driven  attention.  Our  framework  allows  easy 
incorporation  of  those  features  for  saccade  prediction. 

An  important  application  of  our  model  is  quantitative 
analysis  of  differences  among  populations  of  subjects  (e.g., 
young  vs.  elderly  or  novices  vs.  experts)  in  complex  tasks 
such  as  driving.  It  can  also  be  useful  for  assistant  technolo¬ 
gies  for  demanding  tasks,  human  computer  interaction,  con¬ 
text  aware  systems,  and  health  care. 

Although  employed  features  convey  information  regard¬ 
ing  the  next  saccade,  it  is  still  possible  to  gain  higher  perfor¬ 
mance  by  knowing  more  about  the  scene.  For  instance,  by 


Frame  BU  map  (Itti  ef  a/.)  AFM  REG(Gist)  kNN(Gist)  KDE(Gist)  REG(AII)  kNN(AII)  kDE(AII) 


Figure  6.  Sample  frames  of  video  games  and  corresponding  predicted  maps  of  models.  Red  diamond  indicates  the  human  fixation  and  blue  circles  is  the 
maximum  point  of  each  map.  Smaller  distance  hence  means  better  prediction.  Currently  we  don’t  have  action  data  for  TG  game. 


calculating  the  number  or  state  of  task-related  objects.  Such 
approach,  however,  has  the  drawback  that  for  each  task,  rel¬ 
evant  variables  and  interactions  among  them  should  be  de¬ 
fined,  thus  limiting  its  generalization.  We  are  now  investi¬ 
gating  the  role  of  local  context  (P{0\Ly)  in  (1))  in  mod¬ 
ulation  of  top-down  attention.  Instead  of  predicting  fixa¬ 
tion  locations,  it  may  be  more  efficient  to  bias  the  visual 
system  toward  features  of  a  relevant  object  within  a  global 
context.  While  the  exact  fixated  location  at  nearly  the  same 
gist  may  change  based  on  recent  history  of  saccades  and  ac¬ 
tions,  looking  for  a  given  object  rather  than  a  given  location 
may  exhibit  stronger  invariance.  Also,  extraction  and  ad¬ 
dition  of  subjective  factors  such  as  fatigue,  preference,  and 
experience  into  our  model  would  be  an  interesting  next  step. 
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State-of-the-art  in  Visual  Attention  Modeling 

Ali  Borji,  Member,  IEEE,  and  Laurent  Itti,  Member,  IEEE 

Abstract — Modeling  visual  attention  —  particularly  stimulus-driven,  saliency-based  attention  —  has  been  a  very  active  research 
area  over  the  past  25  years.  Many  different  models  of  attention  are  now  available,  which  aside  from  lending  theoretical 
contributions  to  other  fields,  have  demonstrated  successful  applications  in  computer  vision,  mobile  robotics,  and  cognitive 
systems.  Here  we  review,  from  a  computational  perspective,  the  basic  concepts  of  attention  implemented  in  these  models. 
We  present  a  taxonomy  of  nearly  65  models,  which  provides  a  critical  comparison  of  approaches,  their  capabilities,  and 
shortcomings.  In  particular,  thirteen  criteria  derived  from  behavioral  and  computational  studies  are  formulated  for  qualitative 
comparison  of  attention  models.  Furthermore,  we  address  several  challenging  issues  with  models,  including  biological  plausibility 
of  the  computations,  correlation  with  eye  movement  datasets,  bottom-up  and  top-down  dissociation,  and  constructing  meaningful 
performance  measures.  Finally,  we  highlight  current  research  trends  in  attention  modeling  and  provide  insights  for  future. 

Index  Terms — Visual  attention,  bottom-up  attention,  top-down  attention,  saliency,  eye  movements,  regions  of  interest,  gaze 
control,  scene  interpretation,  visual  search,  gist. 

-  >  - 


1  Introduction 

A  RICH  stream  of  visual  data  (10®  —  10®  bits)  enters 
our  eyes  every  second  [1][2].  Processing  this  data  in 
real-time  is  an  extremely  daunting  task  without  the  help 
of  clever  mechanisms  to  reduce  the  amount  of  erroneous 
visual  data.  High-level  cognitive  and  complex  processes 
such  as  object  recognition  or  scene  interpretation  rely  on 
data  that  has  been  transformed  in  such  a  way  to  be  tractable. 
The  mechanism  this  paper  will  discuss  is  referred  to  as 
visual  attention  -  and  at  its  core  lies  an  idea  of  a  selection 
mechanism  and  a  notion  of  relevance.  In  humans,  attention 
is  facilitated  by  a  retina  that  has  evolved  a  high-resolution 
central  fovea  and  a  low-resolution  periphery.  While  visual 
attention  guides  this  anatomical  structure  to  important  parts 
of  the  scene  to  gather  more  detailed  information,  the  main 
question  is  on  the  computational  mechanisms  underlying 
this  guidance. 

In  recent  decades,  many  facets  of  science  have  been 
aimed  towards  answering  this  question.  Psychologists  have 
studied  behavioral  correlates  of  visual  attention  such  as 
change  blindness  [3]  [4],  inattentional  blindness  [5],  and 
attentional  blink  [6].  Neurophysiologists  have  shown  how 
neurons  accommodate  themselves  to  better  represent  objects 
of  interest  [27]  [28].  Computational  neuroscientists  have  built 
realistic  neural  network  models  to  simulate  and  explain 
attentional  behaviors  (e.g.,  [29] [30]).  Inspired  by  these  stud¬ 
ies,  robotists  and  computer  vision  scientists  have  tried  to 
tackle  the  inherent  problem  of  computational  complexity  to 
build  systems  capable  of  working  in  real-time  (e.g.,  [14]  [15]). 
Although  there  are  many  models  available  now  in  the 
research  areas  mentioned  above,  here  we  limit  ourselves  to 
models  that  can  compute  saliency  maps  (please  see  next 
section  for  definitions)  from  any  image  or  video  input.  For 
a  review  on  computational  models  of  visual  attention  in 
general,  including  biased  competition  [10],  selective  tuning 
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Fig.  1.  Taxonomy  of  visual  attention  studies.  Ellipses  with 
solid  borders  illustrate  our  scope  in  this  paper. 


[15],  normalization  models  of  attention  [181],  and  many 
others,  please  refer  to  [8].  Reviews  of  attention  models  from 
psychological,  neurobiological,  and  computational  perspec¬ 
tives  can  be  found  in  [9]  [77]  [10]  [12]  [202]  [204]  [224].  Fig.  1 
shows  a  taxonomy  of  attentional  studies  and  highlights  our 
scope  in  this  review. 

1.1  Definitions 

While  the  terms  attention,  saliency,  and  gaze  are  often 
used  interchangeably,  each  has  a  more  subtle  definition  that 
allows  their  delineation. 

Attention  is  a  general  concept  covering  all  factors  that  in¬ 
fluence  selection  mechanisms,  whether  they  be  scene-driven 
bottom-up  (BU)  or  expectation-driven  top-down  (TD). 

Saliency  intuitively  characterizes  some  parts  of  a  scene  — 
which  could  be  objects  or  regions  —  that  appear  to  an 
observer  to  stand  out  relative  to  their  neighboring  parts.  The 
term  "'salient''  is  often  considered  in  the  context  of  bottom- 
up  computations  [18]  [14]. 

Gaze,  a  coordinated  motion  of  the  eyes  and  head,  has 
often  been  used  as  a  proxy  for  attention  in  natural  behavior 
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Fig.  2.  Neuromorphic  Vision  C++  Toolkit  (iNVT)  developed  at 
iLab,  use,  http://ilab.usc.edu/toolkit/.  A  saccade  is  targeted 
to  the  location  that  is  different  from  its  surroundings  in  several 
features.  In  this  frame  from  a  video,  attention  is  strongly 
driven  by  motion  saliency. 

(see  [99]).  For  instance,  a  human  or  a  robot  has  to  interact 
with  surrounding  objects  and  control  the  gaze  to  perform  a 
task  while  moving  in  the  environment.  In  this  sense,  gaze 
control  engages  vision,  action,  and  attention  simultaneously 
to  perform  sensorimotor  coordination  necessary  for  the 
required  behavior  (e.g.,  reaching  and  grasping). 

1.2  Origins 

The  basis  of  many  attention  models  dates  back  to  Treis- 
man  &  Gelade's  [81]  '^Feature  Integration  Theory"  where 
they  stated  which  visual  features  are  important  and  how 
they  are  combined  to  direct  human  attention  over  pop-out 
and  conjunction  search  tasks.  Koch  and  Ullman  [18]  then 
proposed  a  feed-forward  model  to  combine  these  features 
and  introduced  the  concept  of  a  saliency  map  which  is  a 
topographic  map  that  represents  conspicuousness  of  scene 
locations.  They  also  introduced  a  winner- take-all  neural 
network  that  selects  the  most  salient  location  and  employs 
an  inhibition  of  return  mechanism  to  allow  the  focus  of 
attention  to  shift  to  the  next  most  salient  location.  Several 
systems  were  then  created  implementing  related  models 
which  could  process  digital  images  [15]  [16]  [17].  The  first 
complete  implementation  and  verification  of  the  Koch  & 
Ullman  model  was  proposed  by  Itti  et  ah  [14]  (see  Fig.  2) 
and  was  applied  to  synthetic  as  well  as  natural  scenes.  Since 
then,  there  has  been  increasing  interest  in  the  field.  Various 
approaches  with  different  assumptions  for  attention  mod¬ 
eling  have  been  proposed  and  have  been  evaluated  against 
different  datasets.  In  the  following  sections,  we  present  a 
unified  conceptual  framework  in  which  we  describe  the 
advantages  and  disadvantages  of  each  model  against  one 
another.  We  give  the  reader  insight  into  the  current  state  of 
the  art  in  attention  modeling  and  identify  open  problems 
and  issues  still  facing  researchers. 

The  main  concerns  in  modeling  attention  are  how,  when, 
and  why  we  select  behaviorally-relevant  image  regions. 
Due  to  these  factors,  several  definitions  and  computational 
perspectives  are  available.  A  general  approach  is  to  take 
inspiration  from  the  anatomy  and  functionality  of  the  early 
human  visual  system,  which  is  highly  evolved  to  solve  these 
problems  (e.g.,  [14] [15] [16] [191]).  Alternatively,  some  studies 
have  hypothesized  what  function  visual  attention  may  serve 


and  have  formulated  it  in  a  computational  framework. 
For  instance,  it  has  been  claimed  that  visual  attention  is 
attracted  to  the  most  informative  [144],  the  most  surprising 
scene  regions  [145],  or  those  regions  that  maximize  reward 
regarding  a  task  [109]. 

1.3  Empirical  Foundations 

Attentional  models  have  commonly  been  validated  against 
eye  movements  of  human  observers.  Eye  movements  con¬ 
vey  important  information  regarding  cognitive  processes 
such  as  reading,  visual  search,  and  scene  perception.  As 
such,  they  often  are  treated  as  a  proxy  for  shifts  of  attention. 
For  instance,  in  scene  perception  and  visual  search,  when 
the  stimulus  is  more  cluttered,  fixations  become  longer  and 
saccades  become  shorter  [19].  The  difficulty  of  the  task 
(e.g.,  reading  for  comprehension  versus  reading  for  gist,  or 
searching  for  a  person  in  a  scene  versus  looking  at  the  scene 
for  a  memory  test)  obviously  influences  eye  movement 
behavior  [19].  Although  both  attention  and  eye  movement 
prediction  models  are  often  validated  against  eye  data,  there 
are  slight  differences  in  scope,  approaches,  stimuli,  and  level 
of  detail.  Models  for  eye  movement  prediction  (saccade  pro¬ 
gramming)  try  to  understand  mathematical  and  theoretical 
underpinnings  of  attention.  Some  examples  include  search 
processes  (e.g.,  optimal  search  theory  [20]),  information 
maximization  models  [21],  Mr.  Chips:  an  ideal-observer 
model  of  reading  [25],  EMMA  (Eye  Movements  and  Move¬ 
ment  of  Attention)  model  [139],  HMM  model  for  controlling 
eye  movements  [26],  and  constrained  random  walk  model 
[175]).  To  that  end,  they  usually  use  simple  controlled 
stimuli,  while  on  the  other  hand,  attention  models  utilize 
a  combination  of  heuristics,  cognitive  and  neural  evidence, 
and  tools  from  machine  learning  and  computer  vision  to 
explain  eye  movements  in  both  simple  and  complex  scenes. 
Attention  models  are  also  often  concerned  with  practical 
applicability.  Reviewing  all  movement  prediction  models  is 
beyond  the  scope  of  this  paper.  The  interested  reader  is 
referred  to  [22]  [23]  [127]  for  eye  movement  studies  and  [24] 
for  a  breadth-first  survey  of  eye  tracking  applications. 

Note  that  eye  movements  do  not  always  tell  the  whole 
story  and  there  are  other  metrics  which  can  be  used  for 
model  evaluation.  For  example,  accuracy  in  correctly  report¬ 
ing  a  change  in  an  image  (i.e.,  search-blindness  [5]),  or  pre¬ 
dicting  what  attention  grabbing  items  one  will  remember, 
show  important  aspects  of  attention  which  are  missed  by 
sole  analysis  of  eye  movements.  Many  attention  models  in 
visual  search  have  also  been  tested  by  accurately  estimating 
reaction  times  (RT)  (e.g.,  RT/setsize  slopes  in  pop-out  and 
conjunction  search  tasks  [224] [191]). 

1.4  Applications 

In  this  paper,  we  focus  on  describing  the  attention  models 
themselves.  There  are,  however,  many  technological  appli¬ 
cations  of  these  models  which  have  been  developed  over  the 
years  and  which  have  further  increased  interest  in  attention 
modeling.  We  organize  the  applications  of  attention  model¬ 
ing  into  three  categories:  vision  and  graphics,  robotics,  and 
those  in  other  areas  as  shown  in  Fig.  3. 

1 .5  Statement  and  Organization 

Attention  is  difficult  to  define  formally  in  a  way  that  is 
universally  agreed  upon.  However,  from  a  computational 
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Category 

Application 

References 

Image  segmentation 

Mishra  and  Aloimonas,  2009,  Maki  et  al.,  200D 

Image  quality  assessment 

Ma  and  Zhang,  2008,  Ninassi  et  al.,  2007 

Image  matching 

Walther  et  al.,  2006,  Siaglan  and  Itti,  2009,  Frlntrop  and  Jensfelt, 

2098 

Image  rendering 

□eCarlo  and  Santella,  2092 

Image  and  video  compression 

9uerhani  et  al.,  2993,  Itti,  2994,  Guo  and  Zhang,  291 9. 

Image  thumbnailing 

Marchesotti  et  al.,  2999,  Le  Meur  et  at,  2996,  Suh  et  at,  2993 

cn 

0 

Image  super-resolution 

Jacobson  et  al,  291 9 

JZ 

CL 

Setlur  et  at,  2995,  Ghamaret  et  at,  2998,  Goferman  et  at,  291 9, 

2 

Image  re-targeting  [thumbnailing] 

Achanta  et  al.,  2999,  Marchesotti  et  at,  2999,  Le  Meur  et  at,  2996, 

CD 

Suh  et  at,  2993 

■a 

c 

Image  superresolution 

Sadaka  and  Karam,  2999 

CD 

Video  summarization 

Marat  et  at,  2997,  Ma  et  al.,  2995 

_g 

Scene  classification 

Siaglan  and  Itti,  2999 

cn 

> 

Object  detection 

Frintrop,  2996,  Navalpakkam  and  Itti,  2996,  Fritz  et  al.,  2995,  Butko 

c_ 

and  Movellan,  2999,  Viola  and  Jones,  2994,  Ehinger  et  at,  2999. 

ZJ 

Salient  object  detection 

Liu  et  al,  2997,  Goferman  et  al.  291 9,  Achanta  et  at,  2999,  Rosin, 

Q_ 

c 

2999. 

c 

c3 

Salah  et  at,  2992,  Walther  et  al.,  2996  and  2997,  Frintrop,  2996, 

Object  recognition 

Mitri  et  at,  2995,  Gao  and  Vasconcelos,  2994  and  2999,  Han  and 
Vasconcelos  291 9,  Paletta  et  al.,  2995. 

Visual  tracking 

Mahadevan  and  Vasconcelos,  2999,  Frintrop,  291  9 

Oynamic  lighting 

Self  El-Nasr,  2999 

Video  shot  detection 

Boccignone  et  al.,  2995 

Interest  point  detection 

Kadir  and  Brady,  2991 ,  Klenzie  et  at,  2997. 

Automatic  collage  creation 

Goferman  et  at,  291 9,  Wang  et  al.,  2996. 

Face  segmentation  and  tracking 

LI  and  Ngan,  2998 

Active  vision 

Mertsching  et  al.,  1 999,  VIjaykumar  et  at,  2991 ,  Dankers,  2997,  Borjl 
etal.,  2919 

in 

0 

Robot  Localization 

Slagian  and  Itti,  2999,  Ouerhani  et  al.,  2995 

0 

Robot  Navigation 

Baluja  and  Pomerlau,  1 997,  Scheler  and  Egner,  1 997 

0 

Human-robot  interaction 

Breazeal,  1999,  Heidemann  et  al.,  2994,  Belardininelll,  2998,  Nagai, 

cr 

2999,  Muhl,  2997 

Synthetic  vision  for  simulated 
actors 

Gourty  and  Marchand,  2993 

cn 

Advertising 

Rosenholtz  et  al.  291 1 ,  Liu  et  al.,  2998 

CD 

_c 

Finding  tumors  in  mammograms 

Hong  and  Brady,  2993 

0 

Retinal  prostheses 

Parick  et  at,  291 9 

Fig.  3.  Some  applications  of  visual  attention  modeling. 


standpoint,  many  models  of  visual  attention  (at  least  those 
tested  against  first  few  seconds  of  eye  movements  in  free- 
viewing)  can  be  unified  under  the  following  general  prob¬ 
lem  statement.  Assume  K  subjects  have  viewed  a  set  of  N 
images  X  =  Let  be  the  vector  of 

eye  fixations  (saccades)  and  their  correspond¬ 

ing  occurrence  time  fL  for  the  k-th  subject  over  image  li.  Let 
the  number  of  fixations  of  this  subject  over  z-th  image  be 
The  goal  of  attention  modeling  is  to  find  a  function  (stimuli- 
saliency  mapping)  f  ^  T  which  minimizes  the  error  on 
eye  fixation  prediction,  i.e,  Ylk=i  )?  L^),  where 

m  G  AT  is  a  distance  measure  (defined  in  section  2.7). 
An  important  point  here  is  that  the  above  definition  better 
suits  bottom-up  models  of  overt  visual  attention,  and  may 
not  necessarily  cover  some  other  aspects  of  visual  attention 
(e.g.,  covert  attention  or  top-down  factors)  that  cannot  be 
explained  by  eye  movements. 

Here  we  present  a  systematic  review  of  major  attention 
models  that  we  apply  to  arbitrary  images.  In  section  2,  we 
first  introduce  several  factors  to  categorize  these  models. 
In  section  3,  we  then  summarize  and  classify  attention 
models  according  to  these  factors.  Limitations  and  issues 
in  attention  modeling  are  then  discussed  in  section  4  and 
are  followed  by  conclusions  in  section  5. 


2  Categorization  Factors 

We  start  by  introducing  13  factors  (f  1..13)  that  will  be  used 
later  for  categorization  of  attention  models.  These  factors 
have  their  roots  in  behavioral  and  computational  studies 
of  attention.  Some  factors  describe  models  (fi,2,3,  fs-.ii)/ 
others  (f4..7,  fi2,i3)  are  not  directly  related,  but  are  just  as 
important  as  they  determine  the  scope  of  applicability  of 
different  models. 


2.1  Bottom-up  vs.  Top-down  Models 

A  major  distinction  among  models  is  whether  they  rely  on 
bottom-up  influences  (fi),  top-down  influences  (f2),  or  a 
combination  of  both. 

Bottom-up  cues  are  mainly  based  on  characteristics  of  a 
visual  scene  (stimulus-driven) [75],  whereas  top-down  cues 
(goal-driven)  are  determined  by  cognitive  phenomena  like 
knowledge,  expectations,  reward,  and  current  goals. 

Regions  of  interest  that  attract  our  attention  in  a  bottom- 
up  manner  must  be  sufficiently  distinctive  with  respect  to 
surrounding  features.  This  attentional  mechanism  is  also 
called  exogenous,  automatic,  reflexive,  or  peripherally  cued 
[78].  Bottom-up  attention  is  fast,  involuntary,  and  most 
likely  feed-forward.  A  prototypical  example  of  bottom-up 
attention  is  looking  at  a  scene  with  only  one  horizontal  bar 
among  several  vertical  bars  where  attention  is  immediately 
drawn  to  the  horizontal  bar  [81].  While  many  models  fall 
in  this  category,  they  can  only  explain  a  small  fraction  of 
eye  movements  since  the  majority  of  fixations  are  driven 
by  task  [177]. 

On  the  other  hand,  top-down  attention  is  slow,  task- 
driven,  voluntary,  and  closed-loop  [77].  One  of  the  most 
famous  examples  of  top-down  attention  guidance  is  from 
Yarbus  in  1967  [79],  who  showed  that  eye  movements 
depend  on  the  current  task  with  the  following  experiment: 
subjects  were  asked  to  watch  the  same  scene  (a  room  with 
a  family  and  an  unexpected  visitor  entering  the  room) 
under  different  conditions  (questions)  such  as  "estimate  the 
material  circumstances  of  the  family",  "what  are  the  ages 
of  the  people?",  or  simply  to  freely  examine  the  scene.  Eye 
movements  differed  considerably  for  each  of  these  cases. 

Models  have  explored  three  major  sources  of  top-down 
influences  in  response  to  this  question:  How  do  we  decide 
where  to  look?.  Some  models  address  visual  search  in  which 
attention  is  drawn  toward  features  of  a  target  object  we  are 
looking  for.  Some  other  models  investigate  the  role  of  scene 
context  or  gist  to  constrain  locations  that  we  look  at.  In 
some  cases,  it  is  hard  to  precisely  say  where  or  what  we 
are  looking  at  since  a  complex  task  governs  eye  fixations, 
for  example  in  driving.  While  in  principle,  task  demands  on 
attention  subsumes  the  other  two  factors,  in  practice  models 
have  been  focusing  on  each  of  them  separately.  Scene  layout 
has  also  been  proposed  as  a  source  of  top-down  attention 

[80]  [93]  and  is  here  considered  together  with  scene  context. 

1)  Object  Features.  There  is  a  considerable  amount  of 

evidence  for  target-driven  attentional  guidance  in  real- 
world  search  tasks  [84]  [85]  [23]  [83].  In  classical  search  tasks, 
target  features  are  a  ubiquitous  source  of  attention  guidance 

[81]  [82]  [83].  Consider  a  search  over  simple  search  arrays  in 
which  the  target  is  a  red  item:  attention  is  rapidly  directed 
toward  the  red  item  in  the  scene.  Compare  this  with  a  more 
complex  target  object,  such  as  a  pedestrian  in  a  natural 
scene,  where  although  it  is  difficult  to  define  the  target,  there 
are  still  some  features  (e.g.,  upright  form,  round  head,  and 
straight  body)  to  direct  visual  attention  [87]. 

The  guided  search  theory  [82]  proposes  that  attention 
can  be  biased  toward  targets  of  interest  by  modulating  the 
relative  gains  through  which  different  features  contribute  to 
attention.  To  return  to  our  prior  example,  when  looking  for 
a  red  object,  a  higher  gain  would  be  assigned  to  red  color. 
Navalpakkam  et  al.  [51]  derived  the  optimal  integration  of 
cues  (channels  of  the  BU  saliency  model  [14])  for  detection 
of  a  target  in  terms  of  maximizing  the  signal-to-noise  ratio  of 
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the  target  versus  background.  In  [50],  a  weighting  function 
based  on  a  measure  of  object  uniqueness  was  applied  to 
each  map  before  summing  up  the  maps  for  locating  an 
object.  Butko  et  ah  [161]  modeled  object  search  based  on  the 
same  principles  of  visual  search  as  stated  by  Najemnik  et  ah 
[20]  in  a  partially  observable  framework  for  face  detection 
and  tracking,  but  they  did  not  apply  it  to  explain  eye 
fixations  while  searching  for  a  face.  Borji  et  ah  [89]  used  evo¬ 
lutionary  algorithms  to  search  in  a  space  of  basic  saliency 
model  parameters  for  finding  the  target.  Elazary  and  Itti  [90] 
proposed  a  model  where  top-down  attention  can  tune  both 
the  preferred  feature  (e.g.,  a  particular  hue)  and  the  tuning 
width  of  feature  detectors,  giving  rise  to  more  flexible  top- 
down  modulation  compared  to  simply  adjusting  the  gains 
of  fixed  feature  detectors.  Last  but  not  least  are  studies  such 
as  [147]  [215]  [141]  that  derive  a  measure  of  saliency  from 
formulating  search  for  a  target  object. 

The  aforementioned  studies  on  the  role  of  object  features 
in  visual  search  are  closely  related  to  object  detection  meth¬ 
ods  in  computer  vision.  Some  object  detection  approaches 
(e.g..  Deformable  Part  Model  by  Felzenszwalb  et  ah  [206] 
and  the  Attentional  Cascade  of  Viola  and  Jones  [220])  have 
high  detection  accuracy  for  several  objects  such  as  cars, 
persons,  and  faces.  In  contrast  to  cognitive  models,  such 
approaches  are  often  purely  computational.  Research  on 
how  these  two  areas  are  related  will  likely  yield  mutual 
benefits  for  both. 

2)  Scene  Context.  Following  a  brief  presentation  of  an 
image  (~80  ms  or  less),  an  observer  is  able  to  report 
essential  characteristics  of  a  scene  [176]  [71].  This  very  rough 
representation  of  a  scene,  so  called  "gist",  does  not  con¬ 
tain  many  details  about  individual  objects  but  can  provide 
sufficient  information  for  coarse  scene  discrimination  (e.g., 
indoor  vs.  outdoor).  It  is  important  to  note  that  gist  does 
not  necessarily  reveal  the  semantic  category  of  a  scene. 
Chun  and  Jiang  [91]  have  shown  that  targets  appearing  in 
repeated  configurations  relative  to  some  background  (dis- 
tractor)  objects  were  detected  more  quickly  [71].  Semantic 
associations  among  objects  in  a  scene  (e.g.,  a  computer  is 
often  placed  on  top  of  a  desk)  or  contextual  cues  have  also 
been  shown  to  play  a  significant  role  in  the  guidance  of  eye 
movements  [199]  [84]. 

Several  models  for  gist  utilizing  different  types  of  low- 
level  features  have  been  presented.  Oliva  and  Torralba  [93], 
computed  the  magnitude  spectrum  of  a  Windowed  Fourier 
Transform  over  non-overlapping  windows  in  an  image. 
They  then  applied  principal  component  analysis  (PCA)  and 
independent  component  analysis  (ICA)  to  reduce  feature 
dimensions.  Renninger  and  Malik  [94]  applied  Gabor  filters 
to  an  input  image  and  then  extracted  100  universal  textons 
selected  from  a  training  set  using  K-means  clustering.  Their 
gist  vector  was  a  histogram  of  these  universal  textons. 
Siagian  and  Itti  [95]  used  biological  center-surround  features 
from  orientation,  color,  and  intensity  channels  for  modeling 
gist.  Torralba  [92]  used  wavelet  decomposition  tuned  to 
6  orientations  and  4  scales.  To  extract  gist,  a  vector  is 
computed  by  averaging  each  filter  output  over  a  4  x  4  grid. 
Similarly  he  applied  PCA  to  the  resultant  384D  vectors  to 
derive  a  80D  gist  vector.  For  a  comparison  of  gist  models, 
please  refer  to  [96]  [95]. 

Gist  representations  have  become  increasingly  popular  in 
computer  vision  since  they  provide  rich  global  yet  discrim¬ 
inative  information  useful  for  many  applications  such  as 


search  in  the  large-scale  scene  datasets  available  today  [116], 
limiting  the  search  to  locations  likely  to  contain  an  object  of 
interest  [92]  [87],  scene  completion  [205],  and  modeling  top- 
down  attention  [101]  [2 18]).  It  can  thus  be  seen  that  research 
in  this  area  has  the  potential  to  be  very  promising. 

3)  Task  Demands.  Task  has  a  strong  influence  on  de¬ 
ployment  of  attention  [79].  It  has  been  claimed  that  visual 
scenes  are  interpreted  in  a  need-based  manner  to  serve 
task  demands  [97].  Hayhoe  et  ah  [99]  showed  that  there 
is  a  strong  relationship  between  visual  cognition  and  eye 
movements  when  dealing  with  complex  tasks.  Subjects 
performing  a  visually-guided  task  were  found  to  direct  a 
majority  of  fixations  toward  task-relevant  locations  [99].  It 
is  often  possible  to  infer  the  algorithm  a  subject  has  in  mind 
from  the  pattern  of  her  eye  movements.  For  example,  in 
a  "block-copying"  task  where  subjects  had  to  replicate  an 
assemblage  of  elementary  building  blocks,  the  observers' 
algorithm  for  completing  the  task  was  revealed  by  patterns 
of  eye-movements.  Subjects  first  selected  a  target  block  in 
the  model  to  verify  the  block's  position,  then  fixated  the 
workspace  to  place  the  new  block  in  the  corresponding 
location  [216].  Other  research  has  studied  high-level  ac¬ 
counts  of  gaze  behavior  in  natural  environments  for  tasks 
such  as  sandwich  making,  driving,  playing  cricket,  and 
walking  (see  Henderson  and  Hollingworth  [177],  Rensink 
[178],  Land  and  Hayhoe  [135],  and  Bailensen  and  Yee  [179]). 
Sodhi  et  ah  [180]  studied  how  distractors  while  driving 
such  as  adjusting  the  radio  or  answering  a  phone  affect  eye 
movements. 

The  prevailing  view  is  that  bottom-up  and  top-down 
attention  are  combined  to  direct  our  attentional  behavior. 
An  integration  method  should  be  able  to  explain  when  and 
how  to  attend  to  a  top-down  visual  item  or  skip  it  for  the 
sake  of  a  bottom-up  salient  cue.  Recently,  [13]  proposed 
a  Bayesian  approach  that  explains  the  optimal  integration 
of  reward  as  a  top-down  attentional  cue,  and  contrast  or 
orientation  as  a  bottom-up  cue  in  humans.  Navalpakkam 
and  Itti  [80]  proposed  a  cognitive  model  for  task-driven 
attention  constrained  by  the  assumption  that  the  algorithm 
for  solving  the  task  was  already  available.  Peters  and  Itti 
[101]  learned  a  top-down  mapping  from  scene  gist  to  eye 
fixations  in  video  game  playing.  Integration  was  simply 
formulated  as  multiplication  of  BU  and  TD  components. 

2.2  Spatial  vs.  Spatio-temporal  Models 

In  the  real-world,  we  are  faced  with  visual  information 
that  constantly  changes  due  to  egocentric  movements  or 
dynamics  of  the  world.  Visual  selection  is  then  dependent 
on  both  current  scene  saliency  as  well  as  the  accumulated 
knowledge  from  previous  time  points.  Therefore,  an  atten¬ 
tion  model  should  be  able  to  capture  scene  regions  that  are 
important  in  a  spatio-temporal  manner. 

To  be  detailed  in  section  3,  almost  all  attention  mod¬ 
els  include  a  spatial  component.  We  can  distinguish  be¬ 
tween  two  types  of  modeling  temporal  information  in 
saliency  modeling:  1)  Some  bottom-up  models  use  the 
motion  channel  to  capture  human  fixations  drawn  to 
moving  stimuli  [119].  More  recently,  several  researchers 
have  started  modeling  temporal  effects  on  bottom-up 
saliency  (e.g.,  [143] [104] [105]).  2)  On  the  other  hand,  some 
models  [109]  [218]  [26]  [25]  [102]  aim  to  capture  the  spatio- 
temporal  aspects  of  a  task  for  example  by  learning  se¬ 
quences  of  attended  objects  or  actions  as  the  task  progresses. 
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For  instance,  the  Attention  Gate  Model  (AGM)  [183],  em¬ 
phasizes  the  temporal  response  properties  of  attention  and 
quantitatively  describes  the  order  and  timing  for  humans 
attending  to  sequential  target  stimuli.  Previous  information 
about  images,  eye  fixations,  image  content  at  fixations,  phys¬ 
ical  actions,  as  well  as  other  sensory  stimuli  (e.g.,  auditory) 
can  be  exploited  to  predict  the  next  eye  movement.  Adding 
a  temporal  dimension  and  the  realism  of  natural  interactive 
tasks  brings  a  number  of  complications  in  predicting  gaze 
targets  within  a  computational  model. 

Suitable  environments  for  modeling  temporal  aspects  of 
visual  attention  are  dynamic  and  interactive  setups  such  as 
movies  and  games.  Boiman  and  Irani  [122]  proposed  an  ap¬ 
proach  for  irregularity  detection  from  videos  by  comparing 
texture  patches  of  actions  with  a  learned  dataset  of  irregular 
actions.  Temporal  information  was  limited  to  the  stimulus 
level  and  did  not  include  higher  cognitive  functions  such  as 
the  sequence  of  items  processed  at  the  focus  of  attention  or 
actions  performed  while  playing  the  games.  Some  methods 
derive  static  and  d3mamic  saliency  maps  and  propose  meth¬ 
ods  to  fuse  them  (e.g.,  Jia  Li  et  ah  [133]  and  Marat  et  ah  [49]). 
In  [103],  a  spatio-temporal  attention  modeling  approach  for 
videos  is  presented  by  combining  motion  contrast  derived 
from  the  homography  between  two  images  and  spatial 
contrast  calculated  from  color  histograms.  Virtual  reality 
(VR)  environments  have  also  been  used  in  [99]  [109]  [97]. 
Some  other  models  dealing  with  the  temporal  dimension 
are  [105]  [108]  [103].  We  postpone  the  explanation  of  these 
approaches  to  section  3. 

Factors  fa  indicates  whether  a  model  uses  spatial  only  or 
spatio-temporal  information  for  saliency  estimation. 

2.3  Overt  vs.  Covert  attention 

Attention  can  be  differentiated  based  on  its  attribute  as 
"'overt"  versus  "covert".  Overt  attention  is  the  process  of  di¬ 
recting  the  fovea  towards  a  stimulus  while  covert  attention 
is  mentally  focusing  onto  one  of  several  possible  sensory 
stimuli.  An  example  of  covert  attention  is  staring  at  a  person 
who  is  talking  but  being  aware  of  visual  space  outside  the 
central  foveal  vision.  Another  example  is  driving,  where 
a  driver  keeps  his  eyes  on  the  road  while  simultaneously 
covertly  monitoring  the  status  of  signs  and  lights.  The 
current  belief  is  that  covert  attention  is  a  mechanism  for 
quickly  scanning  the  field  of  view  for  an  interesting  location. 
This  covert  shift  is  linked  to  eye  movement  circuitry  that 
sets  up  a  saccade  to  that  location  (overt  attention)  [203]. 
However,  this  does  not  completely  explain  complex  inter¬ 
actions  between  covert  and  overt  attention.  For  instance, 
it  is  possible  to  attend  to  the  right  hand  corner  field  of 
view  and  actively  suppress  eye  movements  to  that  location. 
Most  of  these  models  detect  regions  that  attract  eye  fixations 
and  few  explain  overt  orientation  of  eyes  along  with  head 
movements.  Lack  of  computational  frameworks  for  covert 
attention  might  be  because  behavioral  mechanisms  and 
functions  of  covert  attention  are  still  unknown.  Further,  it 
is  not  known  yet  how  to  measure  covert  attention. 

Because  of  a  great  deal  of  overlap  between  overt  and 
covert  attention  and  since  they  are  not  exclusive  concepts, 
saliency  models  could  be  considered  as  modeling  both  overt 
and  covert  mechanisms.  However,  in  depth  discussion  of 
this  topic  goes  beyond  our  scope  and  merits  of  this  paper 
and  demands  special  treatment  elsewhere. 


2.4  Space-based  vs.  Object-based  Models 

There  is  no  unique  agreement  on  the  unit  of  attentional 
scale:  Do  we  attend  to  spatial  locations,  to  features,  or  to 
objects?  The  majority  of  psychophysical  and  neurobiologi- 
cal  studies  are  about  space-based  attention  (e.g.,  Posner's 
spatial  cueing  paradigm  [98] [111]).  There  is  also  strong 
evidence  for  feature-based  attention  (detecting  an  odd  item 
in  one  feature  dimension  [81]  or  tuning  curve  adjustments 
of  feature  selective  neurons  [7])  and  object-based  attention 
(selectivity  attending  to  one  of  two  objects,  e.g.,  face  vs. 
vase  illusion  [112]  [113]  [84]).  The  current  belief  is  that  these 
theories  are  not  mutually  exclusive  and  visual  attention 
can  be  deployed  to  each  of  these  candidate  units,  implying 
there  is  no  single  unit  of  attention.  Humans  are  capable 
of  attending  to  multiple  (between  four  and  five)  regions  of 
interest  simultaneously  [114]  [115]. 

In  the  context  of  modeling,  a  majority  of  models  are 
space-based  (see  Fig.  7).  It  is  also  viable  to  think  that  humans 
work  and  reason  with  objects  (compared  with  rough  pixel 
values)  as  main  building  blocks  of  top-down  attentional 
effects  [84].  Some  object-based  attentional  models  have  pre¬ 
viously  been  proposed,  but  they  lack  explanation  for  eye 
fixations  (e.g..  Sun  and  Fisher  [117],  Borji  et  ah  [88]).  This 
shortcoming  makes  verification  of  their  plausibility  difficult. 
For  example,  the  limitation  of  the  Sun  and  Fisher  [117] 
model  is  the  use  of  human  segmentation  of  the  images; 
it  employs  information  that  may  not  be  available  in  the 
pre-attentive  stage  (before  the  objects  in  the  image  are 
recognized).  Availability  of  object- tagged  image  and  video 
datasets  (e.g.,  LabelMe  Image  and  Video  [116] [188])  has 
made  conducting  effective  research  in  this  direction  possi¬ 
ble.  The  link  between  object-based  and  space-based  models 
remains  to  be  addressed  in  the  future.  Feature-based  models 
(e.g.,  [51] [83])  adjust  properties  of  some  feature  detectors 
in  an  attempt  to  make  a  target  object  more  salient  in  a 
distracting  background.  Because  of  the  close  relationship 
between  visual  features  and  objects,  in  this  paper  we  cate¬ 
gorize  feature-based  models  under  object-based  models  as 
shown  in  Fig.  7. 

The  ninth  factor  fg,  indicates  whether  a  model  is  space- 
based  or  object-based  -  meaning  that  it  needs  to  work  with 
objects  instead  of  raw  spatial  locations. 

2.5  Features 

Traditionally  according  to  feature  integration  theory  (FIT) 
and  behavioral  studies  [81]  [82]  [118],  three  features  have 
been  used  in  computational  models  of  attention:  intensity 
(or  intensity  contrast,  or  luminance  contrast),  color,  and 
orientation.  Intensity  is  usually  implemented  as  the  aver¬ 
age  of  three  color  channels  (e.g.,  [14][117])  and  processed 
by  center-surround  processes  inspired  by  neural  responses 
in  lateral  geniculate  nucleus  (LGN)  [10]  and  VI  cortex. 
Color  is  implemented  as  red-green  and  blue-yellow  chan¬ 
nels  inspired  by  color-opponent  neurons  in  VI  cortex,  or 
alternatively  by  using  other  color  spaces  such  as  HSV 
[50]  or  Lab  [160].  Orientation  is  often  implemented  as  a 
convolution  with  oriented  Gabor  filters  or  by  the  application 
of  oriented  masks.  Motion  was  first  used  in  [119]  and  was 
implemented  by  applying  directional  masks  to  the  image 
(in  the  primate  brain  motion  is  derived  by  the  neurons  at 
MT  and  MST  regions  which  are  selective  to  direction  of 
motion).  Some  studies  have  also  added  specific  features  for 
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directing  attention  like  skin  hue  [120],  face  [167],  horizontal 
line  [93],  wavelet  [133],  gist  [92]  [93],  center-bias  [123],  cur¬ 
vature  [124],  spatial  resolution  [125],  optical  flow  [15]  [126], 
flicker  [119],  multiple  superimposed  orientations  (crosses  or 
corners)  [127],  entropy  [129],  ellipses  [128],  symmetry  [136], 
texture  contrast  [131],  above  average  saliency  [131],  depth 
[130],  and  local  center-surround  contrast  [189].  While  most 
models  have  used  the  features  proposed  by  FIT  [81],  some 
approaches  have  incorporated  other  features  like  Difference 
of  Gaussians  (DOG)  [144]  [141]  and  features  derived  from 
natural  scenes  by  IGA  and  PGA  algorithms  [92]  [142].  For 
target  search,  some  have  employed  the  structural  descrip¬ 
tion  of  objects  such  as  the  histogram  of  local  orientations 
[87]  [199].  For  detailed  information  regarding  important  fea¬ 
tures  in  visual  search  and  direction  of  attention,  please  refer 
to  [118]  [81]  [82].  Factor  fio,  categorizes  models  based  on 
features  they  use. 

2.6  Stimuli  and  Task  Type 

Visual  stimulus  can  be  first  distinguished  as  being  either 
static  (e.g.,  search  arrays,  still  photographs;  factor  f^)  or 
dynamic  (e.g.,  videos,  games;  factor  fs).  Video  games  are 
interactive  and  highly  d3mamic  since  they  do  not  generate 
the  same  stimuli  each  run  and  have  nearly  natural  render¬ 
ings,  though  they  still  lag  behind  the  statistics  of  natural 
scenes  and  do  not  have  the  same  noise  distribution.  The 
setups  here  are  more  complex,  more  controversial,  and  more 
computationally  intensive.  They  also  engage  a  large  number 
of  cognitive  behaviors. 

The  second  distinction  is  between  synthetic  stimuli  (Ga¬ 
bor  patches,  search  arrays,  cartoons,  virtual  environments, 
games;  factor  fe)  and  natural  stimuli  (or  approximations 
thereof,  including  photographs  and  videos  of  natural  scenes; 
factor  fr).  Since  humans  live  in  a  d3mamic  world,  video 
and  interactive  environments  provide  a  more  faithful  rep¬ 
resentation  of  the  task  facing  the  visual  system  than  static 
images.  Another  interesting  domain  for  studying  attentional 
behavior,  agents  in  virtual  reality  setups,  can  be  seen  in  the 
work  of  Sprague  and  Ballard  [109],  who  employed  a  realistic 
human  agent  in  VR  and  used  reinforcement  learning  (RL)  to 
coordinate  action  selection  and  visual  perception  in  a  side¬ 
walk  navigation  task  involving  avoiding  obstacles,  staying 
on  the  sidewalk,  and  collecting  litter. 

Factor  f  s  distinguishes  among  task  types.  The  three  most 
widely  explored  tasks  to  date  in  the  context  of  attention 
modeling  are:  (1)  Free  viewing  tasks,  in  which  subjects  are 
supposed  to  freely  watch  the  stimuli  (there  is  no  task  or 
question  here,  but  many  internal  cognitive  tasks  are  usually 
engaged),  (2)  Visual  search  tasks  where  subjects  are  asked 
to  find  an  odd  item  or  a  specific  object  in  a  natural  scene, 
and  (3)  Interactive  tasks.  In  many  real-world  situations, 
tasks  such  as  driving  and  playing  soccer  engage  subjects 
tremendously.  These  complex  tasks  involve  many  subtasks 
such  as  visual  search,  object  tracking,  and  focused  and 
divided  attention. 

2.7  Evaluation  Measures 

So  we  have  a  model  that  outputs  a  saliency  map  S,  and  we 
have  to  quantitatively  evaluate  it  by  comparing  it  with  eye 
movement  data  (or  click  positions)  G.  How  do  you  compare 
these?  We  can  think  of  them  as  probability  distributions,  and 
use  Kullback-Leibler  (KL)  or  Percentile  metrics  to  measure 


distance  between  distributions.  Or  we  can  consider  S  as 
a  binary  classifier  and  use  signal  detection  theory  analysis 
(Area  Under  the  ROC  Curve  (AUG)  metric)  to  asses  the 
performance  of  this  classifier.  We  can  also  think  of  S  and 
G  as  random  variables  and  use  Correlation  Coefficient  (CC) 
or  Normalized  Scanpath  Saliency  (NSS)  to  measure  their 
statistical  relationship.  Another  way  is  to  think  of  G  as 
a  sequence  of  eye  fixations  (scanpath)  and  compare  this 
sequence  with  the  sequence  of  fixations  chosen  by  a  saliency 
model  (string-edit  distance). 

While  in  principle  any  model  might  be  evaluated  using 
any  measure,  in  Fig.  7  we  list  in  factor  fi2  the  measures 
which  were  used  by  the  authors  of  each  model.  In  the  rest, 
when  we  use  Estimated  Saliency  Maps  (ESM  S),  we  mean 
a  saliency  map  of  a  model,  and  by  Ground-truth  Saliency 
Map  (GSM  G),  we  mean  a  map  that  is  built  by  combining 
recorded  eye  fixations  from  all  subjects  or  combining  tagged 
salient  regions  by  human  subjects  for  each  image. 

From  another  perspective,  evaluation  measures  for  at¬ 
tention  modeling  can  be  classified  into  three  categories:  1) 
point-based,  2)  region-based,  and  3)  subjective  evaluation. 
In  point-based  measures,  salient  points  from  ESMs  are  com¬ 
pared  to  GSMs  made  by  combining  eye  fixations.  Region- 
based  measures  are  useful  for  evaluating  attention  models 
over  regional  saliency  datasets  by  comparing  the  ESMs  and 
labeled  salient  regions  (GSM  annotated  by  human  subjects) 
[133].  In  [103],  subjective  scores  on  estimated  saliency  maps 
were  reported  on  three  levels:  "'Good",  "Acceptable",  and 
"Failed".  The  problem  with  such  subjective  evaluation  is 
that  it  is  difficult  to  extend  it  to  large-scale  datasets. 

In  the  following,  we  focus  on  explaining  those  metrics 
with  more  consensus  from  the  literature  and  provide  point¬ 
ers  for  others  (Percentile  [134]  and  Fixation  Saliency  Method 
(FS)  [131]  [182])  for  reference. 

Kullback-Leibler  (KL)  Divergence.  The  KL  divergence 
is  usually  used  to  measure  distance  between  two  proba¬ 
bility  distributions.  In  the  context  of  saliency,  it  is  used 
to  measure  the  distance  between  distributions  of  saliency 
values  at  human  vs.  random  eye  positions  [145]  [77].  Let 
ti  =  1  •  •  •  A  be  A  human  saccades  in  the  experimental 
session.  For  a  saliency  model,  ESM  is  sampled  (or  averaged 
in  a  small  vicinity)  at  the  human  saccade  Xi^human  and  at 
a  random  point  Xi^random-  The  saliency  magnitude  at  the 
sampled  locations  is  then  normalized  to  the  range  [0,1].  The 
histogram  of  these  values  in  q  bins  covering  the  range  [0,1] 
across  all  saccades  is  then  calculated.  Hk  and  Rk  are  the 
fraction  of  points  in  bin  k  for  salient  and  random  points. 
Finally  the  difference  between  these  histograms  with  the 
(symmetric)  KL  divergence  (A.k.a  relative  entropy)  is: 

i  ^  {Hklog^  +  Rklog^)  (1) 

Models  that  can  better  predict  human  fixations  exhibit 
higher  KL  divergence,  since  observers  typically  gaze  to¬ 
wards  a  minority  of  regions  with  the  highest  model  re¬ 
sponses  while  avoiding  the  majority  of  regions  with  low 
model  responses.  Advantages  of  KL  divergence  over  other 
scoring  schemes  [212] [131]  are:  1)  Other  measures  essen¬ 
tially  calculate  the  rightward  shift  oi  Hk  histogram  rela¬ 
tive  to  the  Rk  histogram,  whereas  KL  is  sensitive  to  any 
difference  between  the  histograms,  and  2)  KL  is  invariant 
to  reparameterizations,  such  that  applying  any  continuous 
monotonic  nonlinearity  (e.g.,  to  ESM  values 


IEEE  TRANSACTIONS  ON  PATTERN  ANALYSIS  AND  MACHINE  INTELLIGENCE,  VOL.XXX,  NO.XXX,  XXXXX  2010 


7 


S  does  not  affect  scoring.  One  disadvantage  of  the  KL 
divergence  is  that  it  does  not  have  a  well-defined  upper 
bound  —  as  the  two  histograms  become  completely  non¬ 
overlapping,  the  KL  divergence  approaches  infinity 

Normalized  Scanpath  Saliency  (NSS).  The  normalized 
scanpath  saliency  [134]  [131]  is  defined  as  the  response  value 
at  the  human  eye  position,  {xh^yh),  in  a  model's  ESM 
that  has  been  normalized  to  have  zero  mean  and  unit 
standard  deviation  NSS  =  ■^{S{xh,yh)  —  ys)-  Similar  to 
the  percentile  measure,  NSS  is  computed  once  for  each 
saccade,  and  subsequently  the  mean  and  standard  error 
are  computed  across  the  set  of  NSS  scores.  NSS  =  1 
indicates  that  the  subjects'  eye  positions  fall  in  a  region 
whose  predicted  density  is  one  standard  deviation  above 
average.  Meanwhile  NSS  <  0  indicates  that  the  model 
performs  no  better  than  picking  a  random  position  on  the 
map.  Unlike  KL  and  percentile,  NSS  is  not  invariant  to 
reparameterizations.  Please  see  [134]  for  an  illustration  of 
NSS  calculation. 

Area  Under  Curve  (AUC).  AUC  is  the  area  under  Re¬ 
ceiver  Operating  Characteristic  (ROC)  [195]  curve.  As  the 
most  popular  measure  in  the  community,  ROC  is  used  for 
the  evaluation  of  a  binary  classifier  system  with  a  variable 
threshold  (usually  used  to  classify  between  two  methods 
like  saliency  vs.  random).  Using  this  measure,  the  model's 
ESM  is  treated  as  a  binary  classifier  on  every  pixel  in  the 
image;  pixels  with  larger  saliency  values  than  a  threshold 
are  classified  as  fixated  while  the  rest  of  the  pixels  are 
classified  as  non-fixated  [144]  [167].  Human  fixations  are 
then  used  as  ground  truth.  By  varying  the  threshold,  the 
ROC  curve  is  drawn  as  the  false  positive  rate  vs.  true  positive 
rate,  and  the  area  under  this  curve  indicates  how  well  the 
saliency  map  predicts  actual  human  eye  fixations.  Perfect 
prediction  corresponds  to  a  score  of  1.  This  measure  has  the 
desired  characteristic  of  transformation  invariance,  in  that 
area  under  the  ROC  curve  does  not  change  when  applying 
any  monotonically  increasing  function  to  the  saliency  mea¬ 
sure.  Please  see  [192]  for  an  illustration  of  ROC  calculation. 

Linear  Correlation  Coefficient  (CC).  This  measure  is 
widely  used  to  compare  the  relationship  between  two 
images  for  applications  such  as  image  registration,  object 
recognition,  and  disparity  measurement  [196]  [197].  The  lin¬ 
ear  correlation  coefficient  measures  the  strength  of  a  linear 
relationship  between  two  variables: 


CC(G,S)  = 


- i^s) 

\fN^s 


(2) 


where  G  and  S  represent  the  GSM  (fixation  map,  a  map 
with  I's  at  fixation  locations,  usually  convolved  with  a 
Gaussian)  and  the  ESM,  respectively,  p  and  cr^  are  the  mean 
and  the  variance  of  the  values  in  these  maps.  An  interesting 
advantage  of  CC  is  the  capacity  to  compare  two  variables 
by  providing  a  single  scalar  value  between  -1  and  +1.  When 
the  correlation  is  close  to  +1/  —  1  there  is  almost  a  perfectly 
linear  relationship  between  the  two  variables. 

String  Editing  Distance.  To  compare  the  regions  of  inter¬ 
est  selected  by  a  saliency  model  (mROI)  to  human  regions 
of  interest  (hROI)  using  this  measure,  saliency  maps  and 
human  eye  movements  are  first  clustered  to  some  regions. 
Then  ROIs  are  ordered  by  the  value  assigned  by  the  saliency 
algorithm  or  temporal  ordering  of  human  fixations  in  the 
scanpath.  The  results  are  strings  of  ordered  points  such 


as:  stringh  =  ‘^abcfeffgdN  and  strings  =  “a/6//dcd/“. 
The  string  editing  similarity  index  Ss  is  then  defined  by 
an  optimization  algorithm  with  unit  cost  assigned  to  the 
three  different  operations:  deletion,  insertion,  and  substitution. 
Einally  the  sequential  similarity  between  the  two  strings 
is  defined  as:  similarity  =  1  —  example 

strings,  above  similarity  is  1  —  6/9  =  0.34  (see  [198] [127] 
for  more  information  on  string  editing  distance).  Please 
see  [127]  for  an  illustration  of  this  score. 

2.8  Datasets 

There  are  several  eye  movement  datasets  of  still  images  (for 
studying  static  attention)  and  videos  (for  studying  d3mamic 
attention).  In  Pig.  7  we  list  as  factor  fi3  some  available 
datasets.  Here  we  only  mention  those  datasets  that  are 
mainly  used  for  evaluation  and  comparison  of  attention 
models,  though  there  are  many  other  works  that  have 
gathered  special-purpose  data  (e.g.,  for  driving,  sandwich 
making,  and  block  copying  [135]). 

Pigs.  4  and  5  show  summaries  of  image  and  video  eye 
movements  datasets  (Por  a  few,  labeled  salient  regions  are 
available).  Researchers  have  also  used  mouse  tracking  to 
estimate  attention.  Although  this  type  of  data  is  noisier, 
some  early  results  show  a  reasonably  good  ground-truth 
approximation.  Por  instance,  Scheier  and  Egner  [61]  showed 
that  mouse  movement  patterns  are  close  to  eye-tracking 
patterns.  A  web-based  mouse  tracking  application  was  set 
up  at  the  TCTS  laboratory  [110].  Other  potentially  useful 
datasets  (which  are  not  eye-movement  datasets)  are  tagged- 
object  datasets  like  PASCAL  and  Video  LabelMe.  Some 
attentional  works  have  used  this  type  of  data  (e.g.,  [116]). 

3  Attention  Models 

In  this  section,  models  are  explained  based  on  their  mecha¬ 
nism  to  obtain  saliency.  Some  models  fall  into  more  than  one 
category.  In  the  rest  of  this  review,  we  focus  only  on  those 
models  which  have  been  implemented  in  software  and 
can  process  arbitrary  digital  images  and  return  correspond¬ 
ing  saliency  maps.  Models  are  introduced  in  chronological 
order.  Note  that  here  we  are  more  interested  in  models 
of  saliency  instead  of  those  approaches  that  detect  and 
segment  the  most  salient  region  or  object  in  a  scene.  While 
these  models  use  a  saliency  operator  at  the  initial  stage,  their 
main  goal  is  not  to  explain  attentional  behavior.  However, 
some  methods  have  further  inspired  subsequent  saliency 
models.  Here,  we  reserve  the  term  "saliency  detection"  to 
refer  to  such  approaches. 

3.1  Cognitive  Models  (C) 

Almost  all  attentional  models  are  directly  or  indirectly 
inspired  by  cognitive  concepts.  The  ones  that  have  more 
bindings  to  psychological  or  neurophysiological  findings 
are  described  in  this  section. 

Itti  et  al.'s  basic  model  [14]  uses  three  feature  channels 
color,  intensity,  and  orientation.  This  model  has  been  the 
basis  of  later  models  and  the  standard  benchmark  for 
comparison.  It  has  been  shown  to  correlate  with  human  eye 
movements  in  free-viewing  tasks  [131]  [184].  An  input  image 
is  subsampled  into  a  Gaussian  pyramid  and  each  pyramid 
level  a  is  decomposed  into  channels  for  Red  (R),  Green  (G), 
Blue  (B),  Yellow  (Y),  Intensity  (/),  and  local  orientations 
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Kienzie  et  al.  [165] 

14 

200 

1024x768 

60 

3 

8-bit  grayscale  stimuli  presented  on  a  1 9-inch  liyama  CRT  at  full  screen  size  corresponding  to  37°  x  27°  of  visual  angle. 

Einhauser  et  al.  [84] 

7 

54 

640  X  480 

50 

Overall  32,225  fixations  with  average  fixation  duration  as  370±293  ms  and  11 .9  fixations  per  image.  The  average  distance  of 
subsequent  fixation  points  on  the  screen  is  1 27  pixels[1 9°].  Authors  restricted  their  analysis  to  76°  x55°  regions  which  accounts 
for  92%  [29,725]  of  all  fixations.  Stimuli  was  presented  using  NEC  LT  1  57  projector  at  resolution  1 024  x  768  at  60Hz  on 
average  spanned  133x  1 00cm,  corresponding  to  37°  x  27°  of  visual  angle. 

Ouerhani  et  al.  [210] 

6 

640  X  480 

70 

5 

Age  range  [24-34],  with  normal  or  corrected-to-normal  acuity  as  well  as  normal  color  vision.  Stimulus  presented  on  a  1 9”  monitor 
subtending  29°  x  22°.  Task  was  “just  look  at  the  image”.  Eyetracker:  EyeLink,  SenseoMotoric  Instruments  GmbH.  Recording  at 
250Hz,  accuracy  0.5°-  1  °  accuracy  with  a  3x3  point  grid  calibration  sequence. 

Bruce  and  Tsotsos 
[144] 

20 

120 

681  x511 

75 

4 

Images  [indoor  and  outdoor]  were  presented  at  random  with  2  s  gray  mask  in  between  on  a  21 -inch  CRT  monitor.  The  eye  tracking 
apparatus  consisted  of  an  ERICA  workstation  including  Hitachi  CCD  camera  with  an  IR  emitting  LED.  Stimuli  were  color  images  and 
task  was  free  viewing.  Link:  www-sop.inria.fr/members/Neil.Bruce 

Stark  and  Choi  [21 1  ] 

7 

15 

40 

4 

Bright  Purkinje  reflection  captured  by  a  video  camera.  Stimulus  size  was  1 5  x  20cm  yielding  to  21  °  x  29°  with  the  0.5-1  degree 
accuracy.  Images  were  terrain  photographs,  landscapes  and  paintings.  Task  was  free  viewing. 

Chikkerur  et  al.  [1 54] 

8 

220 

640  X  480 

70 

5 

Scenes  contained  cars  [4.6  ±  3.8]  and  pedestrians  [2.1  ±  2.2]-  visual  angle:  16x12.  Subjects  were  asked  to  count  the  number  of 
cars  or  pedestrians.  Using  an  ETL  400  ISCAN,  table-mounted  video-based  eye  tracker  at  240  HZ  and  accuracy  of  0.5°.  [age:1 8- 
35].  Images  were  1  □□  from  x  and  1 20  from  LabelMe.  Link:  http://www.sharat.org/ 

Torralba  et  al.  [92] 

24 

36 

15.8x11.9 

In  people  search  task,  1 4  stimuli  out  of  36  contained  no  people  and  22  included  1  -6  people.  The  same  set  [36  indoor]  images  was 
used  for  painting  search  [1 7  images  without  any  paintings  and  rest  with  1  -6  paintings]  and  for  mug  search  [half  without  and  half 
with  1  -6  mugs].  Eyetracking  was  performed  by  a  Generation  5.5  SRI  Dual  Purkinje  Image  Eyetracker,  sampling  at  1 0OOHz.  Color 
photos  displayed  on  a  NEC  Multisync  P750  monitor  [143Hz  refresh].  Mean  target  size  was  1  .□5%[1 .24%]  of  the  image  size  for 
people,  7.3%[7.B%]  for  painting  and  0.5%  [0.4%]  for  mugs.  Link:  http://people.csail.mit.edu/torralba/GlobalFeaturesAndAttention/ 

Judd  etal.  [166] 

15 

1003 

Various 

48 

3 

Images  were  collected  from  Flicker  creative  commons  and  LabelMe  datasets.  The  longest  dimension  was  1 024  with  other  ranging 
from  405  to  1 024.  There  were  779  landscape  images  and  228  portrait  images.  Images  were  freely  viewed  with  1  sec  gray 
screen  between  each  two.  Camera  was  recalibrated  after  every  50  images.  First  fixation  was  discarded.  Age  range:  1 8-35. 

Link:  http://people.csail.mit.edu/tjudd/WherePeopleLook/index.html 

Cerf  etal.  [167] 

7 

250 

1024x768 

80 

Eye  position  of  subjects  were  acquired  at  1 0OOHz  using  an  Eyelink  1  □□□  [SR  Research,  Osgoode,  Canada].  The  task  had  three 
phases:  1  ]  free  viewing,  2]  searching  for  face,  an  object,  banana,  cell  phone,  toy  car,  etc  shown  by  a  probe  image,  and  3]  1 00 
image  recognition  memory  task  where  subjects  had  to  answer  with  y/n  whether  they  had  seen  the  image  before.  Stimuli 
subtended  28°  x  21°  of  visual  angle.  Link:  http://www.fifadb.com/ 

Peters  et  al.  [134] 

12 

100/ 

class 

75 

ISCAN  Inc  eye  tracker  was  used  to  sample  eye  movements  at  1 20Hz.  Age  range:  1 8-25;  four  did  free-viewing  over  [outdoor  photos, 
overhead  satellite  imagery,  and  fractals].  Another  4  did  free-viewing  over  involving  Gabor  snakes  and  Gabor  arrays.  Seven  subjects 
did  a  contour  detection  task.  Resolution  was  1  □□□  x  1  □□□  to  1 536  x  1 024  subtending  a  visual  angle  of  1  5.8°  x  1 5.8°  to  1 6.2°  x 
25°.  Link:  http://ilab.usc.edu 

Reinagel  and  Zador 
[212] 

5 

77 

640  X  480 

79 

10 

Images  were  69  nature  scenes,  38  man-made  objects  such  as  buildings,  1 7  animals  or  humans  and  8  synthetics.  An  RK-41 6 
infrared  Pupil  Tracking  System  and  a  21-inch  monitor  was  used.  The  whole  image  subtended  28°  x  21  °  of  visual  angle.  Subjects 
were  instructed  to  “Study  the  images".  Estimated  tracking  error  was  0.5°.  Link:  http://zadorlab.cshl.edu/ 

Hwang  and  Pomplun 
[87] 

30 

160 

1280x1024 

10 

Age  range:  19-40.  Stimuli  were  ISO  photographs  [1280  x  1024]  real-world  scenes  including  landscapes,  home  interiors,  and  city 
scenes  and  covered  20°  x  20°  of  visual  angle.  An  SR  research  EyeLink  II  system.  Stimuli  presented  on  19-inch  Del  P992  monitor 
[85Hz  refresh  rate],  the  whole  image  subtended  28°  x  21°.  Link:  http://www.cs.umb.edu/~marc/ 

Kootstra  et  al., 
[136] 

31 

99 

1024x768 

70 

Eyelink  head-mounted  eye  tracking  [SR  research]  was  used  and  was  recalibrated  before  each  session.  Age  range:  1 7-32.  Task  was 
free  viewing.  Stimuli  were:  1 2  Animals,  1 2  Automan,  1 6  Buildings,  20  flowers,  41  natural  scenes  and  were  shown  on  a  1 8-inch 

CRT  monitor  [36  x  27  cm].  Link:  http://www.csc.kth.se/~kootstra/ 

Tatler[123] 

14 

48 

800  X  600 

60 

Eyelink-I  eye  tracker  was  used.  Subjetcs  had  normal  or  corrected  to  normal  vision  with  age  range  1 7-32.  Image  subtended  30°  x 

22°  and  were  presented  on  a  1 7-inch  SVGA  color  monitor  [74  Hz  refresh].  Task  was  free  viewing.  Link:  http://www.activevisioniab.Qrg/ 

Engmann  et  al., 
[182] 

8 

90 

1280x1024 

85 

Subjetcs  had  normal  or  corrected-to-normal  vision  and  normal  color  vision  with  age  range  20-27  [avg:  22.3].  Stimuli  were 
presented  on  a  1 9.7”  EIZO  FlexScan  F77S  CRT  monitor  [1  □□  Hz  refresh].  Natural  scenes  selected  from  the  Zurich  natural  image 
database  [Einhauser  et  al.  [99]]  which  only  rarely  contain  isolated  nameable  objects  or  man-made  artifacts  at  resolution  2048  ^ 

1 536.  Image  subtended  26°  x  1 8°  1 7-inch  SVGA  color  monitor.  Task  was  free  viewing.  Eye  tracker  was  Eyelink-2000  [SR 

Research  Ltd.  Canada]  with  1 3  point  calibration. 

Engelke  etal.  [213] 

30 

7 

512x512 

60 

8 

Images  were  4  human  faces  [“Barbara”],  1  “Glodhill”  face  [gurilla]  and  1  “Peppers”  images.  Eye  tracker  was  EyeTech  TM3  and  task 
was  free  viewing.  Each  image  was  presented  for  8  sec  with  a  gray  screen  with  central  fixation  in  between. 

Le  Meur  et  al.  [41] 

40 

46 

800  X  600 

* 

Stimuli  were  46  degraded  versions  of  1 0  color  images  using  spatial  filtering.  Task  was  free  viewing.  Eye  tracker  was  made  by 
Cambridge  Research  Corporation.  Viewing  distance  was  four  times  the  TV  monitor  height. 

Link:  http://www.irisa.fr/temics/staff/lemeur 

Ehinger  et  al.  [87] 

14 

912 

800  X  600 

75 

15 

Stimuli  were  color  images  [half  with  a  pedestrian]  with  resolution  800  x  BOO  and  were  shown  on  a  21 -inch  CRT  monitor  with 
resolution  1 024  x  768  and  refresh  rate  1 0OHz.  A  240  Hz  ISCAN  RK-4B4  video-based  eye  tracker  was  used  for  recording.  The 
task  was  to  decide  wether  a  pedestrain  is  in  the  scene  or  not.  Link:  http://cvcl.mit.edu/searchmodels/ 

Rajashekar  et  al. 
[174] 

29 

101 

1042x768 

134 

Subjects  were  1 8  males,  1 1  females  with  mean  age  of  27.  Eye  tracker  was  made  by  Image  Systems  Corp,  MN.  Grayscale  images 
were  shown  on  a  21 -inch  grayscale  gamma  corrected  monitor  with  resolution  1024  x  768.  The  task  was  free  viewing. 

Link:  http://live.ece.utexas.edu/research/doves/ 

Fig.  4.  Some  benchmark  eye  movement  datasets  over  still  images  often  used  to  evaluate  visual  attention  models. 


(O0).  From  these  channels,  center-surround  "'feature  maps" 
fi  for  different  features  I  are  constructed  and  normalized.  In 
each  channel,  maps  are  summed  across  scale  and  normal¬ 
ized  again: 

/  4  c+4  \ 

fl  =  J\f  E  E  fl,c,s  I  ,  V/  G  L/  U  Lc  u  Lo 

\c=2  s=c+3  / 

Li  =  {/},Lc  =  {RG,BY},Lo  =  {0°,  45°,  90°,  135°}  (3) 

These  maps  are  linearly  summed  and  normalized  once 
more  to  yield  the  "conspicuity  maps": 

C/  =  //,Cc'  =  A/'(  ^  fi),Co  XI 

l£L(j  I^Lq 


Finally  conspicuity  maps  are  linearly  combined  once 
more  to  generate  the  saliency  map:  S  =  c  o}^^- 

There  are  at  least  four  implementations  of  this  model: 
iNVT  by  Itti  [14],  Saliency  Toolbox  (STB)  by  Walther  [35], 
VOCUS  by  Frintrop  [50],  and  a  Matlab  code  by  Harel 
[121].  In  [119],  this  model  was  extended  by  adding  motion 
and  flicker  contrasts  to  video  domain.  Zhaoping  Li  [170], 
introduced  a  neural  implementation  for  saliency  map  in  VI 
area  that  can  also  account  for  search  difficulty  in  pop-out 
and  conjunction  search  tasks. 

Le  Meur  et  al.  [41]  proposed  an  approach  for  bottom-up 
saliency  based  on  the  structure  of  the  human  visual  system 
(HVS).  Contrast  sensitivity  functions,  perceptual  decompo- 


IEEE  TRANSACTIONS  ON  PATTERN  ANALYSIS  AND  MACHINE  INTELLIGENCE,  VOL.XXX,  NO.XXX,  XXXXX  2010 


9 


Dataset  Features 


Feature  Value 


CRCNS - 
ORIG[145] 


CRCNS - 
MTV  [145] 


Jia  Li 

etal.  [133] 


Peters  and  Itti 

[101] 


Shic  and 
Scassellati  [74] 


Marat 
et  al.  [49] 


C  50  clips  [0:06-1 :30  min  each],  ~25  min  total,  ~BGB  for  4BK  frames 

S  8  [3  female,  5  male]  subjects  with  normal  corrected  vision.  Ages  23-32,  From  mixed  ethnicities 

T  “Follow  main  actors  and  actions,  try  to  underestand  overall  what  happens  in  each  clip.” 

ST  Complex  video  stimuli  involving  TV  programs,  outdoor  sceces,  video  games  Outdoor  day  Sk  night,  parks,  crowds, 
rooftop  bar.  etc. 

□  ISCAN  RK-4B4  eye  tracker,  240  HZ  recording,  9  point  calibration  after  every  5  clips,  B40x  480  resolution  at 
B0.27HZ  doublescan,  33.1 85ms/movie  frame,  [x,y]  of  each  saccade 

L_ http://crcns.org/data-sets/eye/eye-1 

C  50  video  clips  [4-7  subjects  on  each  video  clip] 

S  8  subjects  different  from  subjects  of  CRCNS 

Q  This  dataset  was  created  by  cutting  video  clips  of  CRCNS  into  1-3s  ’’clipets”  and  reassembling  those  clippets  in 

random  order.  Other  aspects  were  the  same  as  the  original  dataset. 

L  http://crcns.org/data-sets/eye/eye-1 

C  431  videos  with  total  length  of  7.5  hours,  7B4,80B  frames  in  total  with  B2,35B  key  frames 

S  23  [17  male  and  4  female]  subjects  with  age  range  between  21-37 

ST  B  genres:  documentary,  ad,  cartoon,  news,  movie  and  surveillance 

□  1 0-23  subjects  per  each  clip  were  assigned  to  manually  label  the  salient  regions  with  one  or  multiple  rectangles 
from  key  frames.  Drawback  with  this  dataset  is  rectangular  labeling  but  this  may  be  resolved  with 
segmentation,  inefficiney  to  evaluate  whatever 

L _ http://www.jdl.ac.cn/user/jiali/ _ 

C  24  game-play  sessions,  ~1 85  GB  for  21 BK  frames,  8,449  saccades  of  ampitude  2o  or  more 
S  5[3  male,  2  female]  subjetcs  with  normal  corrected  vision 

T  ’’Play  4  or  5  five-minute  segments  of  the  Nintendo  GameCube  games” 

ST  Games  include  Mario  Kart,  Wave  Race,  Super  Mario  Sunshine,  Hulk  and  Pac  Man  World. 

Q  Subjects  were  seated  viewing  distance  of  80  cm  [28°  x  21  °  usable  field  of  view]  Stimuli  were  presented  on  a 
22”  computer  monitor  [LaCie  Corp;  B40  x  480,  75  HZ  refresh,  mean  screen  luminance  30cd/m2,  room 
luminance  4  cd/ m2]  ISCAN  RK-4B4  eye  tracker,  240  HZ  recording,  9  point  calibration  after  before  game 
segment  Frames  were  grabbed  using  a  dual-CPU  Linux  computer  with  SCHED  FIFO  schedulingto  ensure 
micorsecond  accurate  timinq. 

L  http://ilab.usc.edu/rjpeters/ 

C  2  clips,  1  □,  young  adults,  normal  and  mildly  mentally  retarded 

T  "One  minute  long  clips  from  back  and  white  movie  ’’Who’s  afraid  of  Virgnia  Woolf” 

Q  A  head  mounted  eye-tracker  [ISCAN  Inc.]  was  used.  The  eye  tracker  employs  dark  pupil-  corneal  reflection  video- 
occulography  and  had  accuracy  within  ±0.3o  over  a  horizontal  and  range  of  ±20o,  with  a  sampling  rate  of  BO 
Hz.  The  subjects  sat  B3.5  cm  from  the  48.3  cm  screen  on  which  the  movie  was  shown  at  a  resolution  of  B40  x 
480  pixels. 

L _ http://sites.google.com/site/fredshic/home _ 

C  53  short  video  clips  [25  fps,  720  x  57B  pixels  ],  1 700  frames 

S  15  [3f,1 2m]  subjects  with  age  range  23-40  and  had  normal  or  corrected  to  normal  vision 

gj  Each  clip  ~  1  -3sec  long,  324  clip  snippets.  There  was  not  a  particular  task  or  question.  TV  shows,  TV  news, 

animated  movies,  commercials,  sport  and  music.  Indoor,  out-door,  day-time,  night-time] 

Q  The  clip  snippets  were  strung  to  form  20  clips  of  30  seconds  [30.20  ±  U.B1  ].  Eye  positions  were  recorded  at 
500  Hz  [20  eye  positions  per  frame  for  two  eyes]  using  a  Eyelink  II  [SR  Research].  Participants  were  positioned 
with  their  chin  supported  on  a  21  ”  color  monitor  [75  HZ]  at  a  viewing  distance  of  57cm  [40°  x  30°  usable  field 
of  view].  A  calibration  was  carried  out  at  every  five  stimuli  and  a  control  drfit  was  done  before  each  stimuli. 

L  http:/ /  starti  g.ovh.net/~qgsmabaq/  sophie/index.php 


Le  Meur 
etal.  [138] 


C  7  clips  [25  Hz,  352  x  288  pixels  ],  2451  frames.  Each  clip  ~  4.5-33.8  sec  long 

S  1 7-27  subject  for  different  clips  with  normal  or  corrected  to  normal  vision 

T  Free  viewing 

ST  Faces,  sporting  events,  audiencesm,  landscape,  logos,  incrustations,  low  and  high  satiotemporal 
□  Dual-Purkinje  eye  tracker  from  Cambridge  Research  Corporation.  Sampling  frequency  was  50Hz. 

CRT  disolav  800  x  BOO  oixels,  25°  x  27°.  Distance  to  screen  was  81  cm. 

L  http://www.irisa.fr/temics/staff/lemeur 


C:  Clips;  S:  Subjects;  T:  Task;  ST:  Stimuli  Type;  □:  Description;  L:  Link 


Fig.  5.  Some  benchmark  eye  movement  datasets  over  video  stimuli  for  evaluating  visual  attention  prediction. 


sition,  visual  masking,  and  center-surround  interactions  are 
some  of  the  features  implemented  in  this  model.  Later,  Le 
Meur  et  al.  [138]  extended  this  model  to  spatio-temporal 
domain  by  fusing  achromatic,  chromatic  and  temporal  in¬ 
formation.  In  this  new  model,  early  visual  features  are 
extracted  from  the  visual  input  into  several  separate  parallel 
channels.  A  feature  map  is  obtained  for  each  channel,  then 
a  unique  saliency  map  is  built  from  the  combination  of 
those  channels.  The  major  novelty  proposed  here  lies  in  the 
inclusion  of  the  temporal  dimension  as  well  as  the  addition 
of  a  coherent  normalization  scheme. 

Navalpakkam  and  Itti  [51]  modeled  visual  search  as  a  top- 
down  gain  optimization  problem  by  maximizing  the  signal- 
to-noise  ratio  (SNR)  of  the  target  vs.  distr actors  instead 
of  learning  explicit  fusion  functions.  That  is,  they  learned 
linear  weights  for  feature  combination  by  maximizing  the 
ratio  between  target  saliency  and  distractor  saliency. 


Kootstra  et  al.  [136]  developed  three  symmetry-saliency 
operators  and  compared  them  with  human  eye  tracking 
data.  Their  method  is  based  on  the  isotropic  symmetry  and 
radial  symmetry  operators  of  Reisfeld  et  al  [137]  and  the 
color  symmetry  of  Heidemann  [64].  Kootstra  et  al.  extended 
these  operators  to  multi-scale  symmetry-saliency  models. 
The  authors  showed  that  their  model  performs  significantly 
better  on  symmetric  stimuli  compared  to  the  Itti  et  al.  [14]. 

Marat  et  al.  [104]  proposed  a  bottom-up  approach  for 
spatio-temporal  saliency  prediction  in  video  stimuli.  This 
model  extracts  two  signals  from  the  video  stream  corre¬ 
sponding  to  parvocellular  and  magnocellular  cells  of  the 
retina.  From  these  signals,  two  static  and  dynamic  saliency 
maps  are  derived  and  fused  into  a  spatio-temporal  map. 
Prediction  results  of  this  model  were  better  for  the  first  few 
frames  of  each  clip  snippet. 

Murray  et  al.  [200]  introduced  a  model  based  on  a  low 
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level  vision  system  in  three  steps:  1)  visual  stimuli  are  pro¬ 
cessed  according  to  what  is  known  about  the  early  human 
visual  pathway  (color-opponent  and  luminance  channels, 
followed  by  a  multi-scale  decomposition),  2)  a  simulation 
of  the  inhibition  mechanisms  present  in  cells  of  the  visual 
cortex  normalize  their  response  to  stimulus  contrast,  and  3) 
information  is  integrated  at  multiple  scales  by  performing 
an  inverse  wavelet  transform  directly  on  weights  computed 
from  the  non-linearization  of  the  cortical  outputs. 

Cognitive  models  have  the  advantage  of  expanding  our 
view  of  biological  underpinnings  of  visual  attention.  This 
further  helps  understanding  computational  principles  or 
neural  mechanisms  of  this  process  as  well  as  other  complex 
dependent  processes  such  as  object  recognition. 

3.2  Bayesian  Models  (B) 

Bayesian  modeling  is  used  for  combining  sensory  evidence 
with  prior  constraints.  In  these  models,  prior  knowledge 
(e.g.,  scene  context  or  gist)  and  sensory  information  (e.g., 
target  features)  are  probabilistically  combined  according  to 
Bayes'  rule  (e.g.,  to  detect  an  object  of  interest). 

Torralha  [92]  and  Oliva  et  al.  [140]  proposed  a  Bayesian 
framework  for  visual  search  tasks.  Bottom-up  saliency  is  de¬ 
rived  from  their  formulation  as  ^  where  fc  represents 
a  global  feature  that  summarizes  tne  probability  density  of 
presence  of  the  target  object  in  the  scene,  based  on  analysis 
of  the  scene  gist.  Following  the  same  direction,  Ehinger 
et  al  [87]  linearly  integrated  three  components  (bottom- 
up  saliency,  gist,  and  object  features)  for  explaining  eye 
movements  in  looking  for  people  in  a  database  of  about 
900  natural  scenes. 

Itti  and  Baldi  [145]  defined  surprising  stimuli  as  those 
which  significantly  change  beliefs  of  an  observer.  This 
is  modeled  in  a  Bayesian  framework  by  computing  the 
KL  divergence  between  posterior  and  prior  beliefs.  This 
notion  is  applied  both  over  space  (surprise  arises  when 
observing  image  features  at  one  visual  location  affects  the 
observer's  beliefs  derived  from  neighboring  locations)  and 
time  (surprise  then  arises  when  observing  image  features  at 
one  point  in  time  affects  beliefs  established  from  previous 
observations). 

Zhang  et  al.  [141]  proposed  a  definition  of  saliency,  known 
as  SUN:  Saliency  Using  Natural  statistics,  by  considering 
what  the  visual  system  is  trying  to  optimize  when  directing 
attention.  The  resulting  model  is  a  Bayesian  framework  in 
which  bottom-up  saliency  emerges  naturally  as  the  self¬ 
information  of  visual  features,  and  overall  saliency  (incor¬ 
porating  top-down  information  with  bottom-up  saliency) 
emerges  as  the  point-wise  mutual  information  between 
local  image  features  and  the  search  target's  features  when 
searching  for  a  target.  Since  this  model  provides  a  general 
framework  for  many  models,  we  describe  it  in  more  detail. 

SUN's  formula  for  bottom-up  saliency  is  similar  to  the 
work  of  Oliva  et  al.  [140],  Torralba  [92],  and  Bruce  and 
Tsotsos  [144],  in  that  they  are  all  based  on  the  notion  of 
self-information  (local  information).  However,  differences 
between  current  image  statistics  and  natural  statistics  lead 
to  radically  different  kinds  of  self-information.  Briefly,  the 
motivating  factor  for  using  self-information  with  the  statis¬ 
tics  of  the  current  image  is  that  a  foreground  object  is 
likely  to  have  features  that  are  distinct  from  those  of  the 
background.  Since  targets  are  observed  less  frequently  than 


background  during  an  organism's  lifetime,  rare  features  are 
more  likely  to  indicate  targets. 

Let  Z  denote  a  pixel  in  the  image,  C  whether  or  not  a 
point  belongs  to  a  target  class  and  L  the  location  of  a  point 
(pixel  coordinates).  Also,  let  F  be  the  visual  features  of  a 
point.  Having  these,  the  saliency  of  a  point  z  is  defined 
as  P{C  =  1\F  =  fz,L  =  Iz)  where  fz  and  Iz  are  the  feature 
and  location  of  z.  Using  the  Bayes  rule  and  assuming  that 
features  and  locations  are  independent  and  conditionally 
independent  given  C  =  1,  then  saliency  of  a  point  is: 

log  sz  =  -log  P{F  =  fz)  +  log  P{F  =  fz\C  =  1) 

+log  P{C  =  l\L  =  lz)  (5) 

The  first  term  at  the  right  side  is  the  self-information 
(bottom-up  saliency)  and  it  depends  only  on  the  visual 
features  observed  at  the  point  Z.  The  second  term  on  the 
right  is  the  log-likelihood  which  favors  feature  values  that 
are  consistent  with  prior  knowledge  of  the  target  (e.g.,  if 
the  target  is  known  to  be  green  the  log-likelihood  will  take 
larger  values  for  a  green  point  than  for  a  blue  point).  The 
third  term  is  the  location  prior  which  captures  top-down 
knowledge  of  the  target's  location  and  is  independent  of 
visual  features  of  the  object.  For  example,  this  term  may 
capture  knowledge  about  some  target  being  often  found  in 
the  top-left  quadrant  of  an  image. 

Zhang  et  al.  [142]  extended  the  SUN  model  to  d3mamic 
scenes  by  introducing  temporal  filters  (Difference  of  Expo¬ 
nentials)  and  fitting  a  generalized  Gaussian  distribution  to 
the  estimated  distribution  for  each  filter  response.  This  was 
implemented  by  first  applying  a  bank  of  spatio-temporal 
filters  to  each  video  frame,  then  for  any  video,  the  model 
calculates  its  features  and  estimates  the  bottom-up  saliency 
for  each  point.  The  filters  were  designed  to  be  both  efficient 
and  similar  to  the  human  visual  system.  The  probability 
distributions  of  these  spatio-temporal  features  were  learned 
from  a  set  of  videos  from  natural  environments. 

Jia  Li  et  al.  [133]  presented  a  Bayesian  multi-task  learn¬ 
ing  framework  for  visual  attention  in  video.  Bottom-up 
saliency  modeled  by  multi-scale  wavelet  decomposition  was 
fused  with  different  top-down  components  trained  by  a 
multi-task  learning  algorithm.  The  goal  was  to  learn  task- 
related  "stimulus-to-saliency"  functions,  similar  to  [101]. 
This  model  also  learns  different  strategies  for  fusing  bottom- 
up  and  top-down  maps  to  obtain  the  final  attention  map. 

Boccignone  [55]  addressed  joint  segmentation  and  saliency 
computation  in  dynamic  scenes,  using  a  mixture  of  Dirichlet 
processes  as  a  basis  for  object-based  visual  attention.  He  also 
proposed  an  approach  for  partitioning  a  video  into  shots 
based  on  a  fovea  ted  representation  of  a  video. 

A  key  benefit  of  Bayesian  models  is  their  ability  to  learn 
from  data  and  their  ability  to  unify  many  factors  in  a 
principled  manor.  Bayesian  models  can,  for  example,  take 
advantage  of  the  statistics  of  natural  scenes  or  other  features 
that  attract  attention. 

3.3  Decision  Theoretic  Models  (D) 

The  decision-theoretic  interpretation  states  that  perceptual 
systems  evolve  to  produce  decisions  about  the  states  of 
the  surrounding  environment  that  are  optimal  in  a  decision 
theoretic  sense  (e.g.,  minimum  probability  of  error).  The 
overarching  point  is  that  visual  attention  should  be  driven 
by  optimality  with  respect  to  the  end  task. 
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Gao  and  Vasconcelos  [146]  argued  that  for  recognition, 
salient  features  are  those  that  best  distinguish  a  class 
of  interest  from  all  other  visual  classes.  They  then  de¬ 
fined  top-down  attention  as  classification  with  minimal 
expected  error.  Specifically,  given  some  set  of  features  F  = 
{Fi,  •  •  • ,  Fd},  a  location  I  and  a  class  label  C  with  Ci  =  0 
corresponding  to  samples  drawn  from  the  surround  region 
and  Cl  =  1  corresponding  to  samples  drawn  from  a  smaller 
central  region  centered  at  I,  the  judgment  of  saliency  then 
corresponds  to  a  measure  of  mutual  information,  computed 
as  /(F,  C)  =  C).  They  used  DOG  and  Gabor  fil¬ 

ters,  measuring  the  saliency  of  a  point  as  the  KL  divergence 
between  the  histogram  of  filter  responses  at  the  point  and 
the  histogram  of  filter  responses  in  the  surrounding  region. 
In  [185],  the  same  authors  used  this  framework  for  bottom- 
up  saliency  by  combining  it  with  center-surround  image 
processing.  They  also  incorporated  motion  features  (optical 
flow)  between  pairs  of  consecutive  images  to  their  model 
to  account  for  dynamic  stimuli.  They  adopted  a  d3mamic 
texture  model  using  a  Kalman  filter  in  order  to  capture  the 
motion  patterns  in  dynamic  scenes. 

Here  we  show  the  Bayesian  computation  of  (5)  is  a  special 
case  of  the  Decision  theoretic  model.  Saliency  computation 
in  the  entire  decision  theoretic  approach  boils  down  to 
calculating  the  target  posterior  probability  F(C  =  1|F  =  /^) 
(the  output  of  their  simple  cells  [215]).  By  applying  Bayesian 
rule,  we  have: 


P{Ci 


l|F 


fz)  =<y{iog 


P{Fi 


P{Fi 


f,\Ci  =  l)P(Ci  =  l)\ 
f,\C,=0)P(Ci=0)) 


where  a{x)  =  (1  +  is  the  sigmoid  function.  The  log 

likelihood  ratio  inside  the  sigmoid  can  be  trivially  written 
(using  the  independence  assumptions  of  [141])  as: 


-logPiF  =  F\C  =  0)  +  logP{F  =  f,\C 


P{C=l\L  =  F) 
^  F(C  =  0|L  =  F) 
(7) 


which  is  the  same  as  (5)  under  the  following  assumptions: 
1)  F(F  =  /,|C  =  0)  =  F(F  =  /,)  and  2)  F(G  =  0|L  = 
Iz)  —  K,  for  some  constant  K.  Assumption  1  states  that 
the  feature  distribution  in  the  absence  of  the  target  is  the 
same  as  the  feature  distribution  for  the  set  of  natural  images. 
Since  the  overwhelming  majority  of  natural  images  do  not 
have  the  target,  this  is  really  not  much  of  an  assumption. 
The  two  distributions  are  virtually  identical.  Assumption  2 
simply  states  that  the  absence  of  the  target  is  equally  likely 
in  all  image  locations.  This,  again,  seems  like  a  very  mild 
assumption. 

Because  of  above  connections,  both  Decision  theoretic  and 
Bayesian  approaches  have  a  biologically  plausible  imple¬ 
mentation,  which  has  been  extensively  discussed  by  Vascon¬ 
celos  and  colleagues  [223]  [147]  [215].  The  Bayesian  methods 
can  be  mapped  to  a  network  with  a  layer  of  simple  cells 
and  the  decision  theoretic  models  to  a  network  with  a  layer 
of  simple  and  a  layer  of  complex  cells.  The  simple  cell  layer 
in  fact  can  also  implement  AIM  [144]  and  Rosenholtz  [191] 
models  in  Section  3.4,  Elazary  and  Itti  [90],  and  probably 
some  more.  So,  while  these  models  have  not  been  directly 
derived  from  biology,  they  can  be  implemented  as  cognitive 
models. 

Gao  and  Vasconcelos  [147]  used  discriminant  saliency 
model  for  visual  recognition  and  showed  good  performance 
on  PASCAL  2006  dataset. 


Mahadevan  and  Vasconcelos  [105]  presented  an  unsuper¬ 
vised  algorithm  for  spatio-temporal  saliency  based  on  bi¬ 
ological  mechanisms  of  motion-based  perceptual  grouping. 
It  is  an  extension  of  the  discriminant  saliency  model  [146]. 
Combining  center-surround  saliency  with  the  power  of 
dynamic  textures  made  their  model  applicable  to  highly 
dynamic  backgrounds  and  moving  cameras. 

In  Gu  et  al.  [148],  an  activation  map  was  first  computed 
by  extracting  primary  visual  features  and  detecting  mean¬ 
ingful  objects  from  the  scene.  An  adaptable  retinal  filter 
was  applied  to  this  map  to  generate  "'regions  of  interest" 
(ROIs  whose  locations  correspond  to  these  activation  peaks 
and  whose  sizes  were  estimated  by  an  iterative  adjustment 
algorithm).  The  focus  of  attention  was  moved  serially  over 
the  detected  ROIs  by  a  decision  theoretic  mechanism.  The 
generated  sequence  of  eye  fixations  was  determined  from  a 
perceptual  benefit  function  based  on  perceptual  costs  and 
rewards,  while  the  time  distribution  of  different  ROIs  was 
estimated  by  memory  learning  and  decaying. 

Decision  theoretic  models  have  been  very  successful  in 
computer  vision  applications  such  as  classification  while 
achieving  high  accuracy  in  fixation  prediction. 

3.4  Information  Theoretic  Models  (I) 

These  models  are  based  on  the  premise  that  localized 
saliency  computation  serves  to  maximize  information  sam¬ 
pled  from  one's  environment.  They  deal  with  selecting  the 
most  informative  parts  of  a  scene  and  discarding  the  rest. 

Rosenholtz  [191]  [193]  designed  a  model  of  visual  search 
which  could  also  be  used  for  saliency  prediction  over  an 
image  in  free- viewing.  First,  features  of  each  point,  pi, 
are  derived  in  an  appropriate  uniform  feature  space  (e.g., 
uniform  color  space).  Then,  from  the  distribution  of  the 
features,  mean,  /x,  and  covariance,  of  distractor  features 
are  computed.  The  model  then  defines  target  saliency  as 
the  Mahalanobis  distance.  A,  between  the  target  feature 
vector,  T,  and  the  mean  of  the  distractor  distribution,  where 
=  {T  —  fi)'  —  fi).  This  model  is  similar  to 

[92][141][160]  in  the  sense  that  it  estimates  1/P{x)  (rarity  of 
a  feature  or  self-information)  for  each  image  location  x.  This 
model  also  underlies  a  clutter  measure  of  natural  scenes 
(same  authors  [189]).  An  online  version  of  this  model  is 
available  at  [194]. 

Bruce  and  Tsotsos  [144]  proposed  the  AIM  model  (At¬ 
tention  based  on  Information  Maximization)  which  uses 
Shannon's  self-information  measure  for  calculating  saliency 
of  image  regions.  Saliency  of  a  local  image  region  is  the 
information  that  region  conveys  relative  to  its  surroundings. 
Information  of  a  visual  feature  X  is  I{X)  =  —logp{X), 
which  is  inversely  proportional  to  the  likelihood  of  ob¬ 
serving  X  (i.e.,  p{X)).  To  estimate  I{X),  the  probability 
density  function  p{X)  must  be  estimated.  Over  RGB  images, 
considering  a  local  patch  of  size  M  x  N,  X  has  the  high 
dimensionality  of  3  x  M  x  A.  To  make  the  estimation  of 
p{X)  feasible,  they  used  IGA  to  reduce  the  dimensionality 
of  the  problem  to  estimating  3  x  M  x  N  ID  probability 
density  functions.  To  find  the  bases  of  IGA,  they  used  a 
large  sample  of  RGB  patches  drawn  from  natural  scenes. 
For  a  given  image,  the  ID  pdf  for  each  IGA  basis  vector 
is  first  computed  using  non-parametric  density  estimation. 
Then,  at  each  image  location,  the  probability  of  observing 
the  RGB  values  in  a  local  image  patch  is  the  product  of  the 
corresponding  IGA  basis  likelihoods  for  that  patch. 
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Hou  and  Zhang  [151]  introduced  the  Incremental  Coding 
Length  (ICL)  approach  to  measure  the  respective  entropy 
gain  of  each  feature.  The  goal  was  to  maximize  the  entropy 
of  the  sample  visual  features.  By  selecting  features  with 
large  coding  length  increments,  the  computational  system 
can  achieve  attention  selectivity  in  both  d3mamic  and  static 
scenes.  They  proposed  ICL  as  a  principle  by  which  energy 
is  distributed  in  the  attention  system.  In  this  principle, 
the  salient  visual  cues  correspond  to  unexpected  features. 
According  to  the  definition  of  ICL,  these  features  may 
elicit  entropy  gain  in  the  perception  state  and  are  therefore 
assigned  high  energy. 

Mancas  [152]  hypothesized  that  attention  is  attracted  by 
minority  features  in  an  image.  The  basic  operation  is  to 
count  similar  image  areas  by  analyzing  histograms  which 
makes  this  approach  closely  related  to  Shannon's  self¬ 
information  measure.  Instead  of  comparing  only  isolated 
pixels  it  takes  into  account  the  spatial  relationships  of  areas 
surrounding  each  pixel  (e.g.,  mean  and  variance).  Two  types 
of  rarity  models  are  introduced:  Global  and  Local.  While 
global  rarity  considers  uniqueness  of  features  over  entire 
image,  some  image  details  may  still  appear  salient  due 
to  local  contrast  or  rarity.  Similar  to  the  center-surround 
ideas  of  [14],  they  used  a  multi-scale  approach  for  the 
computation  of  local  contrast. 

Seo  and  Milanfar  [108]  proposed  the  Saliency  prediction 
by  Self-Resemblance  (SDSR)  approach.  First  a  local  image 
structure  at  each  pixel  is  represented  by  a  matrix  of  local 
descriptors  (local  regression  kernels),  which  are  robust  in 
the  presence  of  noise  and  image  distortions.  Then,  matrix 
cosine  similarity  (a  generalization  of  cosine  similarity)  is 
employed  to  measure  the  resemblance  of  each  pixel  to  its 
surroundings.  For  each  pixel,  the  resulting  saliency  map 
represents  the  statistical  likelihood  of  its  feature  matrix  Fi 
given  the  feature  matrices  Fj  of  the  surrounding  pixels: 


(8) 


where  p(Fi,Fj)  is  the  matrix  cosine  similarity  between 
two  feature  maps  Fi  and  Fj,  and  cr  is  a  local  weighting 
parameter.  The  columns  of  local  feature  matrices  represent 
the  output  of  local  steering  kernels  which  are  modeled  as: 


K{xi  -  Xi) 


^ydet{Ci)  f  (xi  -  Xi)^Ci(xi  -  Xi)  ] 

1 - - 1 


(9) 


where  I  =  I,...,P,  P  is  the  number  of  the  pixels  in  a 
local  window,  /i  is  a  global  smoothing  parameter,  and  the 
matrix  Ci  is  a  covariance  matrix  estimated  from  a  collection 
of  spatial  gradient  vectors  within  the  local  analysis  window 
around  a  sampling  position  xi  =  [xi,X2]'[. 

Yin  Li  et  al.  [I7I]  proposed  a  visual  saliency  model 
based  on  conditional  entropy  for  both  image  and  video. 
Saliency  was  defined  as  the  minimum  uncertainty  of  a 
local  region  given  its  surrounding  area  (namely  the  min¬ 
imum  conditional  entropy),  when  perceptional  distortion 
is  considered.  They  approximated  the  conditional  entropy 
by  the  lossy  coding  length  of  multivariate  Gaussian  data. 
The  final  saliency  map  was  accumulated  by  pixels  and 
further  segmented  to  detect  the  proto-objects.  Van  et  al  [186] 
proposed  a  newer  version  of  this  model  by  adding  a  multi 
resolution  scheme  to  it. 

Wang  et  al.  [201],  introduced  a  model  to  simulate  hu¬ 
man  saccadic  scanpaths  on  natural  images  by  integrating 


three  related  factors  guiding  eye  movements  sequentially: 
I)  reference  sensory  responses,  2)  fovea-periphery  resolution 
discrepancy,  and  3)  visual  working  memory.  They  compute 
three  multi-band  filter  response  maps  for  each  eye  move¬ 
ment  which  are  then  combined  into  multi-band  residual 
filter  response  maps.  Finally,  they  compute  residual  percep¬ 
tual  information  (RPI)  at  each  location.  The  next  fixation  is 
selected  as  the  location  with  the  maximal  RPI  value. 

3.5  Graphical  Models  (G) 

A  graphical  model  is  a  probabilistic  framework  in  which  a 
graph  denotes  the  conditional  independence  structure  be¬ 
tween  random  variables.  Attention  models  in  this  category 
treat  eye  movements  as  a  time  series.  Since  there  are  hidden 
variables  influencing  the  generation  of  eye  movements, 
approaches  like  Hidden  Markov  Models  (HMM),  D3mamic 
Bayesian  Networks  (DBN),  and  Conditional  Random  Fields 
(CRF)  have  been  incorporated. 

Salah  et  al.  [52]  proposed  an  approach  for  attention  and 
applied  it  to  handwritten  digit  and  face  recognition.  In  the 
first  step  (Attentive  level),  a  bottom-up  saliency  map  is 
constructed  using  simple  features.  In  the  intermediate  level 
"what"  and  "where"  information  is  extracted  by  dividing 
the  image  space  into  uniform  regions  and  training  a  single¬ 
layer  perceptron  over  each  region  in  a  supervised  manner. 
Eventually  this  information  is  combined  at  the  associative 
level  with  a  discrete  Observable  Markov  Model  (OMM). 
Regions  visited  by  a  fovea  are  treated  as  states  of  the  OMM. 
An  inhibition  of  return  allows  the  fovea  to  focus  on  the  other 
positions  in  the  image. 

Liu  et  al.  [43]  proposed  a  set  of  novel  features  and  adopted 
a  Conditional  Random  Field  to  combine  these  features  for 
salient  object  detection  on  their  regional  saliency  dataset. 
Later,  they  extended  this  approach  to  detect  salient  object 
sequences  in  videos  [48].  They  presented  a  supervised  ap¬ 
proach  for  salient  object  detection,  formulated  as  an  image 
segmentation  problem  using  a  set  of  local,  regional  and 
global  salient  object  features.  A  CRF  was  trained  and  eval¬ 
uated  on  a  large  image  database  containing  20,000  labeled 
images  by  multiple  users. 

Harel  et  al.  [I2I]  introduced  Graph-Based  Visual  Saliency 
(GBVS).  They  extract  feature  maps  at  multiple  spatial  scales. 
A  scale-space  pyramid  is  first  derived  from  image  features: 
intensity,  color,  and  orientation  (similar  to  Itti  et  al  [14]). 
Then,  a  fully-connected  graph  over  all  grid  locations  of 
each  feature  map  is  built.  Weights  between  two  nodes  are 
assigned  proportional  to  the  similarity  of  feature  values 
and  their  spatial  distance.  The  dissimilarity  between  two 
positions  (z,  j)  and  (p,  q)  in  the  feature  map,  with  respective 
feature  values  M{i,j)  and  M(p^q),  is  defined  as: 

II  {p,q))  =  I^Qg^(p’gjl  (10) 

The  directed  edge  from  node  (z,  j)  to  node  (p,  g)  is  then 
assigned  a  weight  proportional  to  their  dissimilarity  and 
their  distance  on  lattice  M: 

w{{i,j),(p,q))  =  II  (p,q)).F(i-p,j  -q) 

where  F{a,  b)  =  exp(^  —  ^  )  (11) 

The  resulting  graphs  are  treated  as  Markov  chains  by 
normalizing  the  weights  of  the  outbound  edges  of  each  node 
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to  1  and  by  defining  an  equivalence  relation  between  nodes 
and  states,  as  well  as  between  edge  weights  and  transition 
probabilities.  Their  equilibrium  distribution  is  adopted  as 
the  activation  and  saliency  maps.  In  the  equilibrium  dis¬ 
tribution,  nodes  that  are  highly  dissimilar  to  surrounding 
nodes  will  be  assigned  large  values.  The  activation  maps 
are  finally  normalized  to  emphasize  conspicuous  detail,  and 
then  combined  into  a  single  overall  map. 

Avraham  et  al.  [153]  introduced  the  E-saliency  (Extended 
saliency)  model  by  utilizing  a  graphical  model  approxima¬ 
tion  to  extend  their  static  saliency  model  based  on  self  simi¬ 
larities.  The  algorithm  is  essentially  a  method  for  estimating 
the  probability  that  a  candidate  is  a  target.  The  E-Saliency 
algorithm  is  as  follows:  1)  Candidates  are  selected  using 
some  segmentation  process,  2)  The  preference  for  a  small 
number  of  expected  targets  (and  possibly  other  preferences) 
is  used  to  set  the  initial  (prior)  probability  for  each  candidate 
to  be  a  target,  3)  The  visual  similarity  is  measured  between 
every  two  candidates  to  infer  the  correlations  between  the 
corresponding  labels,  4)  Label  dependencies  are  represented 
using  a  Bayesian  network,  5)  The  N  most  likely  joint  label 
assignments  are  found,  and  6)  Saliency  of  each  candidate  is 
deduced  by  marginalization. 

Pang  et  al.  [102]  presented  a  stochastic  model  of  visual 
attention  based  on  the  signal  detection  theory  account  of 
visual  search  and  attention  [155].  Human  visual  attention 
is  not  deterministic  and  people  may  attend  to  different 
locations  on  the  same  visual  input  at  the  same  time.  They 
proposed  a  dynamic  Bayesian  network  to  predict  where 
humans  typically  focus  in  a  video  scene.  Their  model 
consists  of  four  layers.  In  the  first  layer,  a  saliency  map 
(Itti's)  is  derived  that  shows  the  average  saliency  response 
in  each  location  in  a  video  frame.  Then  in  the  second  layer,  a 
stochastic  saliency  map  converts  the  saliency  map  into  natu¬ 
ral  human  responses  through  a  Gaussian  state  space  model. 
As  to  the  third  layer,  an  eye  movement  pattern  controls  the 
degree  of  overt  shifts  of  attention  through  a  Hidden  Markov 
Model  and  finally  an  eye  focusing  density  map  predicts 
positions  that  people  likely  pay  attention  to  based  on  the 
stochastic  saliency  map  and  eye  movement  patterns.  They 
reported  a  significant  improvement  in  eye  fixation  detection 
over  previous  efforts  at  the  cost  of  decreased  speed. 

Chikkerur  et  al.  [154]  proposed  a  model  similar  to  the 
model  of  Rao  et  al.  [217]  based  on  assumptions  that  the 
goal  of  the  visual  system  is  to  know  what  is  where  and 
that  visual  processing  happens  sequentially.  In  this  model, 
attention  emerges  as  the  inference  in  a  Bayesian  graphical 
model  which  implements  interactions  between  ventral  and 
dorsal  areas.  This  model  is  able  to  explain  some  physiolog¬ 
ical  data  (neural  responses  in  ventral  stream  (V4  and  PIT) 
and  dorsal  stream  (LIP  and  PEP))  as  well  as  psychophysical 
data  (human  fixations  in  free  viewing  and  search  tasks). 

Graphical  models  could  be  seen  as  a  generalized  ver¬ 
sion  of  Bayesian  models.  This  allows  them  to  model  more 
complex  attention  mechanisms  over  space  and  time  which 
results  in  good  prediction  power  (e.g.,  [121]).  The  drawbacks 
lie  in  model  complexity,  especially  when  it  comes  to  training 
and  readability. 

3.6  Spectral  Analysis  Models  (S) 

Instead  of  processing  an  image  in  the  spatial  domain,  mod¬ 
els  in  this  category  derive  saliency  in  the  frequency  domain. 


Hou  and  Zhang  [150]  developed  the  spectral  residual 
saliency  model  based  on  the  idea  that  similarities  imply 
redundancies.  They  propose  that  statistical  singularities  in 
the  spectrum  may  be  responsible  for  anomalous  regions  in 
the  image,  where  proto-objects  become  conspicuous.  Given 
an  input  image  /(x),  amplitude  A{f)  and  phase  V{f)  are 
derived.  Then,  the  log  spectrum  £(/)  is  computed  from 
the  down-sampled  image.  Erom  £(/),  the  spectral  residual 
7^(/)  can  be  obtained  by  multiplying  £(/)  with  hn{f)  which 
is  an  n  X  n  local  average  filter  and  subtracting  the  result  from 
itself.  Using  the  inverse  Eourier  transform,  they  construct 
the  saliency  map  in  the  spatial  domain.  The  value  of  each 
point  in  the  saliency  map  is  then  squared  to  indicate  the 
estimation  error.  Einally  they  smooth  the  saliency  map  with 
a  Gaussian  filter  g{x)  for  better  visual  effect.  The  entire 
process  is  summarized  below: 

A{f)  =  n(^T[Iix)]^,  (12) 

^f)  =  log(^A{f)y 

S{x)  =  g{x)  *  ^exp(TZ{f)  +  P(/))] 

where  T  and  T~^  denote  the  Eourier  and  Inverse  Eourier 
Transforms,  respectively.  V  denotes  the  phase  spectrum 
of  the  image,  and  is  preserved  during  the  process.  Using 
a  threshold  they  find  salient  regions  called  proto  objects 
for  fixation  prediction.  As  a  testament  to  its  conceptual 
clarity,  residual  saliency  could  be  computed  in  5  lines  of 
Matlab  code  [187].  But  note  that  these  lines  exploit  complex 
functions  that  has  long  implementations  (e.g.,  T  and  T~^). 

Guo  et  al.  [156]  showed  that  incorporating  the  phase 
spectrum  of  the  Eourier  transform  instead  of  the  amplitude 
transform  leads  to  better  saliency  predictions.  Later,  Guo 
et  al  [157]  proposed  a  quaternion  representation  of  an 
image  combining  intensity,  color,  and  motion  features.  They 
called  this  method  ''phase  spectrum  of  quaternion  Eourier 
transform  (PQET)"  for  computing  spatio-temporal  saliency 
and  applied  it  to  videos.  Taking  advantage  of  the  multi¬ 
resolution  representation  of  the  wavelet,  they  also  proposed 
a  foveation  approach  to  improve  coding  efficiency  in  video 
compression. 

Achanta  et  al.  [158]  implemented  a  frequency-tuned  ap¬ 
proach  to  salient  region  detection  using  low-level  features  of 
color  and  luminance.  Eirst,  the  input  RGB  image  I  is  trans¬ 
formed  to  CIE  Lab  color  space.  Then,  the  scalar  saliency 
map  S  for  image  I  is  computed  as:  S(x,y)  =  \\I^  —  IcvhcW 
where  is  the  arithmetic  mean  image  feature  vector, 
is  a  Gaussian  blurred  version  of  the  original  image  using 
a  5  X  5  separable  binomial  kernel,  ||.||  is  the  L2  norm 
(Euclidean  distance),  and  x,  y  are  the  pixel  coordinates. 

Bian  and  Zhang  [159]  proposed  the  Spectral  Whitening 
(SW)  model  based  on  the  idea  that  visual  system  bypasses 
the  redundant  (frequently  occurring,  non-informative)  fea¬ 
tures  while  responding  to  rare  (informative)  features.  They 
used  spectral  whitening  as  a  normalization  procedure  in  the 
construction  of  a  map  that  only  represents  salient  features 
and  localized  motion  while  effectively  suppressing  redun¬ 
dant  (non-informative)  background  information  and  ego- 
motion.  Eirst,  a  grayscale  input  image  /(x,  y)  is  low-pass  fil- 
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tered  and  subsampled.  Next,  a  windowed  Fourier  transform 
of  the  image  is  calculated  as:  f{u,  v)  =  F[w{I{x,  y))],  where 
F  denotes  the  Fourier  transform  and  w  is  a  windowing 
function.  The  normalized  (flattened  or  whitened)  spectral 
response  (ji(u,v)  =  f(u,v)/\\f(u,v)\\^  is  transformed  into 
the  spatial  domain  through  the  inverse  Fourier  transform 
(F~^)  squared  to  emphasize  salient  regions.  Finally  it  is 
convolved  with  a  Gaussian  low-pass  filter  g{u,  v)  to  model 
the  spatial  pooling  operation  of  complex  cells:  S{x,y)  = 
g{u,v)^\\F-^[n{u,v)]\\\ 

Spectral  analysis  models  are  simple  to  explain  and  imple¬ 
ment.  While  still  very  successful,  biological  plausibility  of 
these  models  is  not  very  clear. 

3.7  Pattern  Classification  Models  (P) 

Machine  learning  approaches  have  also  been  used  in  mod¬ 
eling  visual  attention  by  learning  models  from  recorded 
eye-fixations  or  labeled  salient  regions.  Typically,  attention 
control  works  as  a  "'stimuli-saliency''  function  to  select,  re¬ 
weight,  and  integrate  the  input  visual  stimuli.  Note  that 
these  models  may  not  be  purely  bottom-up  since  they  use 
features  that  guide  top-down  attention  (e.g.,  faces  or  text). 

Kienzle  et  al.  [165]  introduced  a  non-parametric  bottom- 
up  approach  for  learning  attention  directly  from  human  eye 
tracking  data.  The  model  consists  of  a  nonlinear  mapping 
from  an  image  patch  to  a  real  value,  trained  to  yield  positive 
outputs  on  fixations,  and  negative  outputs  on  randomly 
selected  image  patches.  The  saliency  function  is  determined 
by  its  maximization  of  prediction  performance  on  the  ob¬ 
served  data.  A  support  vector  machine  (SVM)  was  trained 
to  determine  the  saliency  using  the  local  intensities.  For 
videos,  they  proposed  to  learn  a  set  of  temporal  filters  from 
eye-fixations  to  find  the  interesting  locations. 

The  advantage  of  this  approach  is  that  it  does  not  need  a 
priori  assumptions  about  features  that  contribute  to  salience 
or  how  these  features  are  combined  to  a  single  salience 
map.  Also  this  method  produces  center-surround  operators 
analogous  to  receptive  fields  of  neurons  in  early  visual  areas 
(LGN  and  VI). 

Peters  and  Itti  [101]  trained  a  simple  regression  classifier 
to  capture  the  task-dependent  association  between  a  given 
scene  (summarized  by  its  gist)  and  preferred  locations  to 
gaze  at  while  human  subjects  were  playing  video  games. 
During  testing  of  the  model,  the  gist  of  a  new  scene  is  com¬ 
puted  for  each  video  frame,  and  is  used  to  compute  the  top- 
down  map.  They  showed  that  a  point- wise  multiplication 
of  bottom-up  saliency  with  the  top-down  map  learned  in 
this  way  results  in  higher  prediction  performance. 

Judd  et  al.  [166],  similar  to  Kienzle  et  al  [165],  trained  a  lin¬ 
ear  SVM  from  human  fixation  data  using  a  set  of  low,  mid, 
and  high-level  image  features  to  define  salient  locations. 
Feature  vectors  from  fixated  locations  and  random  locations, 
were  assigned  +1  and  —1  class  labels,  respectively.  Their 
results  over  a  dataset  of  1003  images  observed  by  15  subjects 
(gathered  by  the  same  authors)  show  that  combining  all 
aforementioned  features  plus  distance  from  image  center 
produces  the  best  eye  fixation  prediction  performance. 

As  available  eye  movement  data  increases  and  with  wider 
spread  of  eye  tracking  devices  supporting  gathering  mass 
data,  these  models  are  becoming  popular.  This  however, 
makes  models  data-dependent  thus  influencing  fair  model 
comparison,  slow,  and  to  some  extent,  black-box. 


3.8  Other  Models  (O) 

Some  other  attention  models  that  do  not  fit  into  our  cate¬ 
gorization  are  discussed  below. 

Ramstrom  and  Christiansen  [168]  introduced  a  saliency 
measure  using  multiple  cues  based  on  game  theory  concepts 
inspired  by  the  selective  tuning  approach  of  Tsotsos  et  al 
[15].  Feature  maps  are  integrated  using  a  scale  pyramid 
where  the  nodes  are  subject  to  trading  on  a  market  and  the 
outcome  of  the  trading  represents  the  saliency.  They  use  the 
spot-light  mechanism  for  finding  regions  of  interest. 

Rao  et  al.  [23]  proposed  a  template  matching  type  of 
model  by  sliding  a  template  of  the  desired  target  to  every 
location  in  the  image  and  at  each  location  compute  salience 
as  some  similarity  measure  between  template  and  local 
image  patch. 

Ma  et  al.  [33]  proposed  a  user  attention  model  to  video 
contents  by  incorporating  top-down  factors  into  the  classical 
bottom-up  framework  by  extracting  semantic  cues  (e.g., 
face,  speech,  and  camera  motion).  First,  the  video  sequence 
is  decomposed  into  primary  elements  of  basic  channels. 
Next,  a  set  of  attention  modeling  methods  generate  attention 
maps  separately.  Finally,  fusion  schemes  are  employed  to 
obtain  a  comprehensive  attention  map  which  may  be  used 
as  importance  ranking  or  the  index  of  video  content.  They 
applied  this  model  to  video  summarization. 

Rosin  [169]  proposed  an  edge-based  scheme  (EDS)  for 
saliency  detection  over  grayscale  images.  First,  a  Sobel  edge 
detector  is  applied  to  the  input  image.  Second,  the  graylevel 
edge  image  is  thresholded  at  multiple  levels  to  produce  a 
set  of  binary  edge  images.  Third,  a  distance  transform  is 
applied  to  each  of  the  binary  edge  images  to  propagate  the 
edge  information.  Finally,  the  gray-level  distance  transforms 
are  summed  to  obtain  the  overall  saliency  map.  This  ap¬ 
proach  has  not  been  successful  over  color  images. 

Garcia-Diaz  et  al.  [160]  introduced  the  Adaptive  Whitening 
Saliency  (AWS)  model  by  adopting  the  variability  in  local 
energy  as  a  measure  of  saliency  estimation.  The  input 
image  is  transformed  to  Lah  color  space.  The  luminance  (L) 
channel  is  decomposed  into  multi-oriented  multi-resolution 
representation  by  means  of  Gabor-like  bank  of  filters.  The 
opponent  color  components  a  and  b  undergo  a  multi-scale 
decomposition.  By  decorrelating  the  multi-scale  responses, 
extracting  from  them  a  local  measure  of  variability,  and 
further  performing  a  local  averaging  they  obtained  a  unified 
and  efficient  measure  of  saliency.  Decorrelation  is  achieved 
by  applying  PGA  over  a  set  of  multi-scale  low  level  features. 
Distinctiveness  is  measured  using  the  Hoteling's  statistic. 

Goferman  et  al.  [46]  proposed  a  context-aware  saliency 
detection  model.  Salient  image  regions  are  detected  based 
on  four  principles  of  human  attention:  1)  Local  low-level 
considerations  such  as  color  and  contrast,  2)  Global  consid¬ 
erations  which  suppress  frequently  occurring  features  while 
maintaining  features  that  deviate  from  the  norm,  3)  Visual 
organization  rules  which  state  that  visual  forms  may  possess 
one  or  several  centers  of  gravity  about  which  the  form  is 
organized,  and  4)  High-level  factors,  such  as  human  faces. 
They  applied  their  saliency  method  to  two  applications:  re¬ 
targeting  and  summarization. 

Aside  from  the  models  discussed  so  far,  there  are  several 
other  attention  models  that  are  relevant  to  the  topic  of 
this  review,  though  they  do  not  explicitly  generate  saliency 
maps.  Here  we  mention  them  briefly. 
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To  overcome  the  problem  of  designing  the  state-space 
for  a  complex  task,  an  approach  proposed  by  Sprague  and 
Ballard  [109]  decomposes  a  complex  temporally-extended 
task  to  simple  behaviors  (also  called  micro-behaviors),  one 
of  which  is  to  attend  to  obstacles  or  other  objects  in  the 
world.  This  behavior-based  approach  learns  each  micro¬ 
behavior  and  uses  arbitration  to  compose  these  behaviors 
and  solve  complex  tasks.  This  complete  agent  architecture  is 
of  interest  as  it  studies  the  role  of  attention  while  it  interacts 
and  shares  limited  resources  with  other  behaviors. 

Based  on  the  idea  that  vision  serves  action,  Jodogne  et 
al.  [162]  introduced  an  approach  for  learning  action-based 
image  classification  known  as  Reinforcement  Learning  of 
Visual  Classes  (RLVC).  RLVC  consists  of  two  interleaved 
learning  processes:  An  RL  unit  which  learns  image  to  ac¬ 
tion  mappings  and  an  image  classifier  which  incrementally 
learns  to  distinguish  visual  classes.  RLVC  is  a  feature-based 
approach  in  which  the  entire  image  is  processed  to  find 
out  whether  a  specific  visual  feature  exists  or  not  in  order 
to  move  in  a  binary  decision  tree.  Inspired  by  RLVC  and 
U-TREE  [163],  Borji  et  al  [88]  proposed  a  three-layered 
approach  for  interactive  object-based  attention.  Each  time 
the  object  that  is  most  important  to  disambiguate  appears, 
a  partially  unknown  state  is  attended  by  the  biased  bottom- 
up  saliency  model  and  recognized.  Then  the  appropriate 
action  for  the  scene  is  performed.  Some  other  models  in 
this  category  are:  Triesch  et  al  [97],  Mirian  et  al  [100],  and 
Paletta  et  al  [164]. 

Walker  et  al  [21]  built  a  model  based  on  the  idea  that 
humans  fixate  at  those  informative  points  in  an  image  which 
reduce  our  overall  uncertainty  about  the  visual  stimulus  - 
similar  to  another  approach  by  Lee  and  Yu  [149].  This  model 
is  a  sequential  information  maximization  approach  whereby 
each  fixation  is  aimed  at  the  most  informative  image  loca¬ 
tion  given  the  knowledge  acquired  at  each  point.  A  foveated 
representation  is  incorporated  with  reducing  resolution  as 
distance  increases  from  the  center.  Shape  histogram  edges 
are  used  as  features. 

Lee  and  Yu  [149]  proposed  that  mutual  information 
among  the  cortical  representations  of  the  retinal  image, 
the  priors  constructed  from  our  long-term  visual  expe¬ 
rience,  and  a  dynamic  short-term  internal  representation 
constructed  from  recent  saccades,  all  provide  a  map  for 
guiding  eye  navigations.  By  directing  the  eyes  to  locations 
of  maximum  complexity  in  neuronal  ensemble  responses 
at  each  step,  the  automatic  saccadic  eye  movement  system 
greedily  collects  information  about  the  external  world  while 
modifying  the  neural  representations  in  the  process.  This 
model  is  close  to  Najemnik  &  Geisler's  work  [20]. 

To  recap,  here,  we  offer  a  unification  of  several 
saliency  models  from  a  statistical  viewpoint.  The  first  class 
measures  bottom-up  saliency  as  1/P{x)  or  logP{x)  or 
Ex[—logP{x)\  which  is  the  entropy.  This  includes  Tor- 
ralba  and  Oliva  [92]  [93],  SUN  [141],  AIM  [144],  Hou  and 
Zhang  [151],  and  probably  Yin  Li  [171].  Some  other  methods 
are  equivalent  to  this  but  with  specific  assumptions  for 
P{x).  Eor  example,  Rosenholtz  [191]  assume  a  Gaussian, 
and  Seo  and  Milanfar  [108]  assumes  that  P(x)  is  a  kernel 
density  estimate  (with  the  kernel  that  appears  inside  the 
summation  on  the  denominator  of  (7)).  Next,  there  is  a  class 
of  top-down  models  with  the  same  saliency  measure.  Eor 
example,  Elazary  and  Itti  [90]  use  logP{x\Y  —  1)  (where 
Y  —  1  means  target  presence)  and  assume  a  Gaussian 
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Fig.  6.  A  hierarchical  illustration  of  described  models.  Solid 
rectangles  show  salient  region  detection  methods. 


for  P{x\Y  —  1).  SUN  can  also  be  seen  like  this,  if  you 
call  the  first  term  of  (5)  a  bottom-up  component.  But,  as 
discussed  next,  it  is  probably  better  to  just  consider  it  an 
approximation  to  the  methods  in  the  third  class.  The  third 
class  includes  models  that  compute  posterior  probabilities 
P{Y  =  1|A)  or  likelihood  ratios  log[P{x\Y  =  l)/P{x\Y  = 
0)].  This  is  the  case  of  discriminant  saliency  [146]  [147]  [215] 
but  also  appears  in  Harel  et  al  [121]  (e.g.  equation  10)  and  in 
Liu  et  al  [43]  (if  you  set  the  interaction  potentials  of  a  CRE 
to  zero,  you  end  up  with  a  computation  of  the  posterior 
P(Y  =  1|-Y)  at  each  location).  All  these  methods  model 
the  saliency  of  each  location  independently  of  the  others. 
The  final  class,  graphical  models,  introduces  connections 
between  spatial  neighbors.  These  could  be  clique  potentials 
in  CREs,  edge  weights  in  Harel  et  al  [121],  etc. 

Eig.  6  shows  a  hierarchical  illustration  of  models.  A  sum¬ 
mary  of  attention  models  and  their  categorization  according 
to  factors  mentioned  in  section  2  is  presented  in  Eig.  7. 

4  Discussion 

There  are  a  number  of  outstanding  issues  with  attention 
models  that  we  discuss  next. 

A  big  challenge  is  the  degree  to  which  a  model  agrees 
with  biological  findings.  Why  is  such  an  agreement  im¬ 
portant?  How  can  we  judge  whether  a  model  is  indeed 
biologically  plausible?  While  there  is  no  clear  answer  to 
these  questions  in  the  literature,  here  we  give  some  hints 
at  their  answer.  In  the  context  of  attention,  biologically 
inspired  models  have  resulted  in  higher  accuracies  in  some 
cases.  In  support  of  this  statement,  the  Decision  theoretic 
[147]  [223]  and  (later)  AWS  model  [160]  (and  perhaps  some 
other  models)  are  good  examples  because  they  explains 
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No 

Model 

Year 

f1 

f2 

f3 

f4 

f5 

fB 

f7 

fS 

f9 

f10 

f11 

f12 

f13 

Bottom-up  (saliency  models) 

1 

Itti  et  al.  [1 4] 

1998 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

CIO 

c 

2 

Privitera  &  Stark  [1 27] 

2000 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

- 

0 

Stark  and  Choi 

3 

Salah  et  al.  [52] 

2002 

+ 

+ 

- 

+ 

- 

- 

+ 

- 

+ 

□ 

G 

DR 

Digit  &  Face 

4 

Itti  et  al.  [1 1 9] 

2003 

+ 

- 

+ 

+ 

+ 

+ 

+ 

f 

+ 

CIOFM 

C 

5 

Torralba  [92] 

2003 

- 

+ 

- 

+ 

- 

- 

+ 

s 

+ 

Cl 

B 

DR 

Torralba  et  al. 

6 

Sun  &  Fisher  [11 7] 

2003 

+ 

- 

- 

+ 

- 

- 

+ 

- 

- 

CIO 

G 

7 

Gao  &  Vasconcelos  [146] 

2004 

- 

+ 

- 

+ 

- 

- 

+ 

s 

- 

OCT 

0 

DR 

Brodatz,  Caltech 

8 

□uerhani  et  al.  [210] 

2004 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

ClO+Corner 

C 

CD 

Ouerhani 

9 

Boccignone  &  Ferraro  [175] 

2004 

+ 

- 

+ 

- 

+ 

- 

+ 

f 

- 

Optical  Flow 

B 

BEHAVE 

10 

Frintrop  [50] 

2005 

+ 

+ 

+ 

+ 

+ 

+ 

+ 

f/S 

V- 

CIOM 

C 

11 

Itti  &Baldi  [145] 

2005 

+ 

- 

+ 

+ 

+ 

+ 

- 

f 

+ 

CIOFM 

B 

KL,  AUC 

ORIG-MTV 

12 

Ma  et  al.  [33] 

2005 

+ 

- 

+ 

+ 

- 

- 

+ 

f 

+ 

M* 

0 

13 

Bruce  S.  Tsotsos  [1 44] 

2006 

+ 

- 

- 

+ 

- 

+ 

+ 

f 

+ 

OOG,  ICA 

I 

KL,  ROC 

Bruce  and  Tsotsos 

14 

Navalpakkam  &  Itti  [51  ] 

2006 

- 

+ 

- 

+ 

- 

+ 

+ 

s 

+ 

CIO 

c 

15 

Zhai&  Shah  [103] 

2006 

+ 

- 

+ 

+ 

+ 

- 

+ 

f 

+ 

SIFT 

0 

16 

Harel  etal.  [121] 

2006 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

lO 

G 

AUC 

Bruce  and  Tsotsos 

17 

Le  Meur  et  al.  [41  ] 

2006 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

LM* 

C 

CC,  KL 

Le  Meur  et  al. 

18 

Walther  &  Koch  [35] 

2006 

+ 

- 

- 

+ 

- 

+ 

+ 

f 

V- 

CIO 

C 

19 

Peters  S.  Itti  [1 01  ] 

2007 

+ 

+ 

+ 

+ 

+ 

- 

+ 

i 

+ 

CIOFM 

P 

KL,  NSS 

Peters  and  Itti 

20 

Liu  et  al.  [43] 

2007 

+ 

- 

- 

+ 

- 

- 

+ 

f 

- 

Liu* 

G 

F-measure 

Regional 

21 

Shic  &  Scassellati  [74] 

2007 

+ 

- 

+ 

+ 

+ 

- 

+ 

f 

+ 

CIOM 

C 

ROC 

Shic  and  Scassellati 

22 

Hou&  Zhang  [150] 

2007 

+ 

- 

- 

+ 

- 

+ 

+ 

f 

+ 

FFT,  OCT 

S 

NSS 

DB  of  Hou  and  Zhang,  2007 

23 

Cerf  etal.  [167] 

2007 

+ 

+ 

- 

+ 

- 

+ 

+ 

f/s 

+ 

CIO :) 

C 

AUC 

Cerf  et  al. 

24 

Le  Meur  et  al.  [1 38] 

2007 

+ 

- 

+ 

+ 

+ 

- 

+ 

f 

+ 

LM* 

C 

CC,  KL 

Le  Meur  et  al. 

25 

Mancas  [1  52] 

2007 

+ 

- 

+ 

+ 

+ 

+ 

+ 

f 

+ 

Cl 

I 

CC 

Le  Meur  et  al. 

26 

Guo  et  al.  [1 56] 

2008 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

CIO 

0 

CC 

Self  data 

27 

Zhang  et  al.  [141] 

2008 

+ 

- 

- 

+ 

- 

+ 

+ 

f 

+ 

OOG,  ICA 

B 

KL,  AUC 

Bruce  and  Tsotsos 

28 

Hou&  Zhang  [151] 

2008 

+ 

- 

+ 

+ 

+ 

- 

+ 

f 

+ 

ICA 

I 

AUC,  KL 

Bruce  and  Tsotsos,  ORIG 

29 

Pang  etal.  [102] 

2008 

+ 

+ 

+ 

+ 

+ 

- 

+ 

f 

+ 

CIOM 

G 

NSS 

ORIG,  Self  data 

30 

Kootstra  et  al.  [1 36] 

2008 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

Symmetry 

C 

CC 

Kootstra  et  al. 

31 

Ban  et  al.  [1 72] 

2008 

+ 

- 

+ 

+ 

+ 

- 

+ 

f 

+ 

CIO+SYM 

I 

32 

Rajashekar  et  al.  [1 74] 

2008 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

R* 

S 

CC 

Rajashekar  et  al. 

33 

Kienzie  et  al.  [1 65] 

2009 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

I 

P 

K* 

Kienzie  et  al. 

34 

Marat  et  al.  [49] 

2009 

+ 

- 

+ 

+ 

+ 

- 

+ 

f 

+ 

SM* 

c 

NSS 

Marat  et  al. 

35 

Judd  etal.  [166] 

2009 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

J* 

p 

AUC 

Judd  et  al. 

36 

Seo  &  Milanfar  [1 08] 

2009 

+ 

- 

+ 

+ 

+ 

+ 

+ 

f 

+ 

LSK 

I 

AUC,  KL 

Bruce  and  Tsotsos,  ORIG 

37 

Rosin  [1 69] 

2009 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

C+  Edge 

0 

PR,  F-measure 

DB  of  Liu  et  al,  2007 

38 

Yin  Li  etal.  [171] 

2009 

- 

+ 

+ 

+ 

+ 

+ 

+ 

s 

+ 

RGB 

s 

DR 

DB  of  Hou  and  Zhang,  2007 

39 

Bian  S.  Zhang  [1 59] 

2009 

+ 

- 

+ 

+ 

+ 

+ 

+ 

f 

+ 

FFT 

s 

AUC 

Bruce  and  Tsotsos 

40 

Diaz  et  al.  [1 60] 

2009 

+ 

- 

- 

+ 

- 

+ 

+ 

f 

+ 

CIO 

0 

AUC 

Bruce  and  Tsotsos 

41 

Zhang  et  al.  [142] 

2009 

+ 

- 

+ 

- 

+ 

- 

+ 

f 

+ 

OOG,  ICA 

B 

KL,  AUC 

Bruce  and  Tsotsos 

42 

Achanta  et  al.  [1 58] 

2009 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

OOG 

S 

PR 

DB  of  Liu  et  al,  2007 

43 

Gao  et  al.  [147] 

2009 

+ 

- 

+ 

+ 

+ 

+ 

+ 

f 

+ 

CIO 

0 

AUC 

Bruce  and  Tsotsos 

44 

Chikkerur  et  al.  [1 54] 

2010 

+ 

+ 

- 

+ 

- 

+ 

+ 

f/s 

V- 

CIO 

B 

AUC 

Bruce  and  Tsotsos,  Chikkerur 

45 

Mahadaven  &  Vasconcelos  [1 06]  201 0 

+ 

- 

+ 

- 

+ 

- 

+ 

- 

+ 

I 

0 

DR,  AUC 

SVCL  background  data 

46 

Avraham  &  Lindenbaum  [1 53] 

2010 

+ 

+ 

- 

+ 

- 

+ 

+ 

f/s 

V- 

CIO 

G 

DR,  CC 

UWGT,  Ouerhani  et  al. 

47 

Jia  Li  et  al.  [1 33] 

2010 

- 

+ 

+ 

+ 

+ 

- 

+ 

f 

+ 

CIO 

B 

AUC 

RSD,  MTV,  ORIG,  Peters  and  Itti 

48 

Guo  et  al.  [1 57] 

2010 

+ 

- 

+ 

+ 

+ 

+ 

+ 

f/s 

V- 

FFT 

S 

DR 

Self  data 

49 

Borji  et  al.  [89] 

2010 

- 

+ 

- 

+ 

- 

+ 

+ 

s 

V- 

CIO 

0 

DR 

50 

Goeferman  et  al.  [46] 

2010 

+ 

- 

- 

+ 

- 

- 

+ 

- 

+ 

C:) 

0 

AUC 

DB  of  Hou  and  Zhang,  2007 

51 

Murray  et  al.  [200] 

2011 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 

CIO 

C 

AUC,  KL 

Bruce  and  Tsotsos,  Judd  et  al. 

52 

Wang  etal.  [201] 

2011 

+ 

- 

- 

+ 

- 

- 

+ 

f 

+ 
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Fig.  7.  Summary  of  visual  attention  models.  Factors  in  order  are:  Bottom-up  (/i),  Top-down  (/2),  Spatial  (-)/Spatio-temporal  {+)  {fs), 
Static  iU),  Dynamic  (/s),  Synthetic  (/e)  and  Natural  (/r)  stimuli,  Task-type  (/s),  Space-based(-F)/Object-based(-)  (/g),  Features 
(/lo),  Model  type  (/n),  Measures  (/12),  and  Used  dataset  (/13).  In  Task  type  (/s)  column:  free-viewing  (/);  target  search  (s); 
interactive  {i).  In  Features  (/lo)  column:  M*  =  motion  saliency,  static  saliency,  camera  motion,  object  (face)  and  aural  saliency 
(Speech-music);  LM*  =  contrast  sensitivity,  perceptual  decomposition,  visual  masking  and  center-surround  interactions;  Liu*  =  center- 
surround  histogram,  multi-scale  contrast  and  color  spatial-distribution;  R*  =  luminance,  contrast,  luminance-bandpass,  contrast- 
bandpass;  SM*  =  orientation  and  motion;  J*  =  CIO,  horizontal  line,  face,  people  detector,  gist,  etc;  S*  =  color  matching,  depth  and 
lines;  :)  =  face.  In  Model  type  (/n)  column,  R  means  that  a  model  is  based  RL.  In  Measures  (/12)  column:  K*  =  used  Wilcoxon- 
Mann-Whitney  test  (The  probability  that  a  random  chosen  target  patch  receives  higher  saliency  than  a  randomly  chosen  negative 
one);  DR  means  that  models  have  used  a  measure  of  detection/classification  rate  to  determine  how  successful  was  a  model.  PR 
stands  for  Precision-Recall.  In  dataset  (/13)  column:  Self  data  means  that  authors  gathered  their  own  data. 
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Fig.  8.  Sample  images  from  image  and  video  datasets  along  with  eye  fixations  and  predicted  attention  maps.  As  could  be  seen, 
human  and  animal  body  and  face,  symmetry,  and  text  attract  human  attention.  Fourth  row  shows  that  these  datasets  are  highly 
center-biased  mainly  because  there  are  some  interesting  objects  at  the  image  center  (MEP  map).  Less  center-bias  at  mean  saliency 
map  of  models  indicates  that  a  Gaussian  in  average  works  better  than  many  models. 


some  basic  behavioral  data  (e.g.,  nonlinearity  against  ori¬ 
entation  contrast,  efficient  (parallel)  and  inefficient  (serial) 
search,  orientation  and  presence-absence  asymmetries,  and 
Weber's  law  [75])  well  that  has  been  less  explored  by  other 
models.  These  models  are  among  the  best  in  predicting 
fixations  over  images  and  videos  [160].  Hence,  biological 
plausibility  could  be  rewarding.  We  believe  that  creating  a 
standard  set  of  experiments  for  judging  biological  plausi¬ 
bility  of  models  would  be  a  promising  direction  to  take. 
For  some  models,  prediction  of  fixations  is  more  important 
than  agreement  with  biology  (e.g.,  pattern  classification  vs. 
cognitive  models).  These  models  usually  feed  features  to 
some  classifier  -  but  what  type  of  features  or  classifiers  fall 
under  the  realm  of  biologically  inspired  techniques?  The 
answer  lies  in  the  behavioral  validity  of  each  individual 
feature  as  well  as  the  classifier  (e.g.,  faces  or  text,  SVM  vs. 
Neural  Networks).  Note  that  these  problems  are  not  specific 
to  attention  modeling  and  are  applicable  to  other  fields  in 
computer  vision  (e.g.,  object  detection  and  recognition). 

Regarding  fair  model  comparison,  results  often  disagree 
when  using  different  evaluation  metrics.  Therefore,  a  uni¬ 
fied  comparison  framework  is  required  -  one  that  standard¬ 
izes  measures  and  datasets.  We  should  also  discuss  the 
treatment  of  image  borders  and  its  influence  on  results.  For 
example,  KL  and  NSS  measures  are  corrupted  by  an  edge 
effect  due  to  variations  in  handling  invalid  filter  responses 
at  the  image  borders.  Zhang  et  al.  [141]  studied  the  impact 
of  varying  amounts  of  edge  effects  on  ROC  score  over  a 
dummy  saliency  map  (consisting  of  all  ones)  and  showed 
that  as  the  border  increases,  AUC  and  KL  measures  increase 
as  well.  The  dummy  saliency  map  gave  an  ROC  value  of 
0.5,  a  four-pixel  black  border  gave  0.62,  and  an  eight-pixel 
black  border  map  gave  0.73.  The  same  3  border  sizes  would 
yield  KL  scores  of  0,  0.12,  and  0.25.  Another  challenge  is 
handling  the  center-bias  that  results  from  a  high  density  of 
eye  fixations  at  the  image  center.  Because  of  this,  a  trivial 


Gaussian  blob  model  scores  higher  than  almost  all  saliency 
models  (see  [166]).  This  can  be  partially  verified  from  the 
average  eye  fixation  maps  of  three  popular  datasets  shown 
in  Fig.  8.  Comparing  the  mean  saliency  map  of  models 
and  the  fixation  distributions,  it  could  be  seen  that  Judd 
et  al.  [166]  model  has  higher  center-bias  due  to  explicitly 
using  the  center  feature,  which  leads  to  higher  eye  move¬ 
ment  prediction  for  this  model  as  well.  To  eliminate  the 
border  and  center-bias  effects,  Zhang  et  al.  [141]  defined  an 
unshuffled  AUC  metric  instead  of  the  uniform  AUC  metric: 
for  an  image,  the  positive  sample  set  is  composed  of  the 
fixations  of  all  subjects  on  that  image  and  the  negative  set 
is  composed  of  the  union  of  all  fixations  across  all  images  - 
except  for  the  positive  samples. 

As  shown  by  Figs.  4  and  5  many  different  eye  movement 
datasets  are  available,  each  one  recorded  in  different  experi¬ 
mental  conditions  with  different  stimuli  and  tasks.  Yet  more 
datasets  are  needed  because  the  available  ones  suffer  from 
several  drawbacks.  Consider  that  current  datasets  do  not  tell 
us  about  covert  attention  mechanisms  at  all  and  can  only  tell 
us  about  overt  attention  (eye  tracking).  One  approximation 
can  compare  overt  attention  shifts  to  verbal  or  other  reports, 
whereby  reported  objects  that  were  not  fixated  might  have 
been  covertly  attended  to.  There  is  also  a  lack  of  multi¬ 
modal  datasets  in  interactive  environments.  In  this  regard, 
a  promising  new  effort  is  to  create  tagged  object  datasets 
similar  to  video  LabelMe  [188].  Bruce  and  Tsotsos  [144]  and 
ORIG  [184]  are  respectively  the  most  widely  used  image  and 
video  datasets  though  they  are  highly  center-biased  (see  Fig. 
8).  Thus  there  is  a  need  for  standard  benchmark  datasets  as 
well  as  rigorous  performance  measures  for  attention  model¬ 
ing.  Similar  efforts  have  already  been  started  amongst  other 
research  communities,  such  as  object  recognition  (PASCAL 
challenge),  text  information  retrieval  (TREC  datasets),  and 
face  recognition  (e.g.,  FERET). 

The  majority  of  models  are  bottom-up  though  it  is  known 
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that  top-down  factors  play  a  major  role  in  directing  attention 
[177].  However,  the  field  of  attention  modeling  lacks  prin¬ 
cipled  ways  to  model  top-down  attention  components  as 
well  as  the  interaction  of  bottom-up  and  top-down  factors. 
Feed-forward  bottom-up  models  are  general,  easy  to  apply, 
do  not  need  training,  and  yield  reasonable  performance 
making  them  good  heuristics.  On  the  other  hand,  top-down 
definitions  usually  use  feedback  and  employ  learning  mech¬ 
anisms  to  adapt  themselves  to  specific  tasks /environments 
and  stimuli,  making  them  more  powerful  but  more  complex 
to  deploy  and  test  (e.g.,  need  to  train  on  large  datasets). 

Some  models  need  many  parameters  to  be  tuned  while 
some  others  need  fewer  (e.g.,  spectral  saliency  models). 
Methods  such  as  Gao  et  al  [147],  Itti  et  al  [14],  Oliva  et 
ah  [140],  and  Zhang  et  al.  [142])  are  based  on  Gabor  or 
DOG  filters  and  require  many  design  parameters  such  as 
the  number  and  type  of  filters,  choice  of  non-linearities,  and 
normalization  schemes.  Properly  tuning  the  parameters  is 
important  in  performance  of  these  types  of  models. 

Fig.  9  presents  sample  saliency  maps  of  some  models 
discussed  in  this  paper. 

5  Summary  and  Conclusion 

In  this  paper,  we  discussed  recent  advances  in  modeling 
visual  attention  with  an  emphasis  on  bottom-up  saliency 
models.  A  large  body  of  past  research  was  reviewed  and 
organized  in  a  unified  context  by  qualitatively  comparing 
models  over  15  experimental  criteria.  Advancement  in  this 
field  could  greatly  help  solving  other  challenging  vision 
problems  such  as  cluttered  scene  interpretation  and  object 
recognition.  In  addition,  there  are  many  technological  ap¬ 
plications  that  can  benefit  from  it.  Several  factors  influenc¬ 
ing  bottom-up  visual  attention  have  been  discovered  by 
behavioral  researchers  and  have  further  inspired  the  mod¬ 
eling  community.  However,  there  are  several  other  factors 
remaining  to  be  discovered  and  investigated.  Incorporating 
those  additional  factors  may  help  to  bridge  the  gap  between 
human  inter-observer  (a  map  built  from  fixations  of  other 
subjects  over  the  same  stimulus)  and  prediction  accuracy  of 
computational  models.  With  the  recent  rapid  progress,  there 
is  hope  this  may  be  accessible  in  the  near  future. 

Most  of  the  previous  modeling  research  has  been  focused 
on  the  bottom-up  component  of  visual  attention.  While 
previous  efforts  are  appreciated,  the  field  of  visual  attention 
still  lacks  computational  principles  for  task-driven  attention. 
A  promising  direction  for  future  research  is  the  develop¬ 
ment  of  models  that  take  into  account  time  varying  task 
demands,  especially  in  interactive,  complex,  and  dynamic 
environments.  In  addition,  there  is  not  yet  a  principled 
computational  understanding  of  covert  and  overt  visual 
attention,  which  should  be  clarified  in  the  future.  The  solu¬ 
tions  are  beyond  the  scope  of  computer  vision  and  require 
collaboration  from  the  machine  learning  community. 
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Fig.  9.  Sample  saliency  maps  of  models  over  Bruce  and  Tsotsos 
(left),  Kootstra  et  al.  (middle),  and  Judd  et  al.  datasets.  Black 
rectangles  means  dataset  was  first  used  by  that  model. 
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Humans  employ  interacting  bottom-up  and  top-down  processes  to  significantly  speed  up  search  and  rec¬ 
ognition  of  particular  targets.  We  describe  a  new  model  of  attention  guidance  for  efficient  and  scalable 
first-stage  search  and  recognition  with  many  objects  (117,174  images  of  1147  objects  were  tested,  and 
40  satellite  images).  Performance  for  recognition  is  on  par  or  better  than  SIFT  and  HMAX,  while  being, 
respectively,  1500  and  279  times  faster.  The  model  is  also  used  for  top-down  guided  search,  finding  a 
desired  object  in  a  5  x  5  search  array  within  four  attempts,  and  improving  performance  for  finding 
houses  in  satellite  images. 
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1.  Introduction 

Attempting  to  search  for  and  recognize  particular  known  ob¬ 
jects  in  a  scene  can  be  extremely  complex  when  one  has  to  con¬ 
sider  all  possible  views  an  object  can  take.  Fiumans  employ 
attention  to  try  to  limit  the  amount  of  information  that  needs  to 
be  processed  in  order  to  speed  up  search  and  recognition  (we 
rarely  look  at  the  sky  when  searching  for  our  car).  Previous  re¬ 
search  has  shown  that  visual  search  tasks  can  be  performed  faster 
when  one  knows  the  exact  target  in  visual  space,  as  opposed  to 
only  a  semantic  description  of  the  target  (Wolfe,  1998).  Therefore, 
humans  use  cues  from  the  target  image  to  help  facilitate  search. 
One  can  also  consider  implementing  attention  in  the  feature  do¬ 
main  when  searching  through  a  large  dataset  for  a  particular  ob¬ 
ject.  For  example,  if  we  wish  to  search  for  a  green  bottle,  we 
could  bias  the  visual  system  so  that  green  vertical  edges  would 
be  perceived  faster  than  other  features  (since  bottles  are  often  up¬ 
right).  This  would  allow  us  to  focus  a  more  complex  recognition 
onto  only  the  locations  in  the  search  scene  that  contain  green  ver¬ 
tical  edges,  which  would  speed  up  the  search  significantly.  Like¬ 
wise,  during  recognition,  that  green  vertical  edge  may  be  useful 
to  quickly  narrow  down  onto  a  smaller  set  of  possible  recognition 
candidates.  The  use  of  various  features  in  this  manner  can  help  sift 
through  very  large  object  datasets  when  attempting  to  recognize 
objects  (consider  the  large  number  of  objects  that  an  adult  human 
can  identify).  Lastly,  it  has  been  shown  by  Tsotsos  (1991)  that 
knowing  the  features  of  a  target  reduces  the  complexity  of  visual 
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search  from  NP  complete  to  linear.  These  findings  suggest  that  hu¬ 
mans  employ  various  heuristics  to  improve  the  tractability  of  per¬ 
forming  search  and  recognition.  In  this  paper,  we  develop  a  model 
which  explores  the  use  of  biologically  plausible  attentional  heuris¬ 
tics  to  speed  up  search  and  recognition. 

It  is  well  known  that  the  search  and  recognition  behavior  in  hu¬ 
mans  can  be  explained  through  the  combination  of  bottom-up 
information  from  the  incoming  visual  scene  (Itti  &  Koch,  2001; 
Theeuwes,  1995),  and  of  top-down  information  from  the  visual 
knowledge  of  the  target  and  the  scene  (Moran  &  Desimone, 
1985;  Motter,  1994;  Treue  &  Trujillo,  1999;  Wolfe  et  al.,  2004; 
Krummenacher,  Muller,  Reimann,  &  Fleller,  2001 ;  Theeuwes, 
1994;  Flayhoe  &  Ballard,  2005).  Flowever,  the  exact  interaction  be¬ 
tween  the  two  processes  still  remains  elusive,  which  has  made  it 
difficult  to  develop  machine  vision  systems  exploiting  both  bot¬ 
tom-up  and  top-down  information. 

There  have  been  at  least  three  major  theories  on  mechanisms  of 
integration  between  bottom-up  and  top-down  vision  occurring  in 
the  visual  cortex.  The  first  is  Feature  Integration  Theory  (Treisman 
&  Gelade,  1980;  Treisman  &  Sato,  1990),  in  which  several  low-level 
visual  features  are  processed  over  the  entire  visual  field  in  separate 
neuronal  maps  (called  feature  maps),  and  then  combined  to  form  a 
master  map  that  guides  attention.  If  the  target  can  be  defined  by  a 
set  of  primitive  feature  maps  (e.g.,  it  has  a  distinct  color,  orientation, 
intensity),  then  these  maps  can  be  biased  using  such  top-down  infor¬ 
mation  to  elicit  the  target  location.  Flowever,  if  the  target  is  defined 
only  by  some  conjunctions  of  these  primitive  feature  maps  (e.g.,  a 
unique  combination  of  color  and  orientation),  then  a  serial  search 
is  required  to  find  the  target,  since  a  unique  signature  of  the  target 
cannot  be  obtained  from  the  separate  feature  maps  alone.  In 
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contrast,  the  Guided  Search  method  proposed  by  Wolfe  (1994)  cre¬ 
ates  a  master  activation  map  where  top-down  knowledge  is  used 
to  weigh  the  relative  contributions  of  bottom-up  feature  maps  to 
emphasize  both  features  (e.g.,  a  red  color)  and  locations  (e.g.,  the 
top-left  corner  of  the  image)  likely  to  characterize  the  target.  The 
model  then  uses  the  combination  of  these  weighted  maps  to  shift 
attention  towards  the  most  promising  locations.  Lastly,  the  Biased 
Competition  Model  proposed  by  Desimone  and  Duncan  (1995)  in¬ 
volves  competition  between  visual  stimuli  at  each  stage  of  process¬ 
ing,  which  is  influenced  by  top-down  modulation.  In  this  model, 
attention  biases  the  response  of  a  local  feature  detector  when  two 
stimuli  are  simultaneously  exciting  it  (i.e.,  are  presented  within 
the  receptive  field  of  the  same  visual  neuron).  The  response  is  biased 
in  the  direction  of  the  attended  feature  in  a  different  location.  In  all 
these  models,  choosing  the  correct  feature  maps  to  use  in  visual 
search,  as  well  as  deciding  how  exactly  to  influence  these  maps  with 
top-down  information,  is  crucial  to  search  performance. 

Previous  models  such  as  Feature  Integration  Theory  (Treisman 
&  Gelade,  1980;  Treisman  &  Sato,  1990),  Guided  Search  (Wolfe, 
1994),  Biased  Competition  Model  (Desimone  &  Duncan,  1995) 
and  Optimal  Gains  (Navalpakkam  &  Itti,  2006a,  2007)  have  largely 
concentrated  on  biasing  the  feature  maps  in  a  global  way  to  facil¬ 
itate  efficient  search  (Fig.  1).  For  example,  changing  gains  (or 
weights)  over  whole  maps  has  been  proposed  and  implemented 
in  Wolfe  (1994),  Navalpakkam  and  Itti  (2006a)  and  Treue  and  Truj¬ 
illo  (1999).  However,  simply  setting  feature  gains  globally  may  not 
always  accelerate  search  for  a  target  object,  especially  for  maps 
that  code  for  features  shared  by  the  target  and  many  distractors. 
Furthermore,  previous  models  have  concentrated  on  determining 
the  values  of  these  gains  from  the  objects  so  as  to  guide  search  to¬ 
wards  them,  but  most  have  not  shown  how  they  can  be  used  for 
object  recognition.  In  this  work  a  common  representational  frame¬ 
work  is  used  for  learning  how  to  bias  towards  desired  targets  and 
for  recognizing  these  targets  when  they  are  found.  Thus  allowing 
the  same  top-down  signals  or  parameters  used  for  attention  bias¬ 
ing,  to  also  be  used  for  recognition. 


One  of  the  previous  proposals  to  compute  the  gain  or  weight  of 
particular  feature  maps  is  to  base  the  values  on  the  signal  to  noise 
ratio,  defined  as  the  ratio  of  a  detector’s  response  to  the  target  rel¬ 
ative  to  a  distractor.  Namely,  this  approach  proposed  that  the  rel¬ 
ative  weights  of  feature  maps  should  be  modulated  top-down  in 
proportion  to  each  map’s  ability  to  distinguish  the  target  from 
the  distractors  (Navalpakkam  &  Itti,  2006a;  Navalpakkam  &  Itti, 
2007).  One  shortcoming  of  such  an  approach  is  that,  if  the  detec¬ 
tors  in  a  given  feature  map  respond  to  both  the  target  and  the  dis¬ 
tractors  equally,  then  no  change  in  gain  will  take  place  (Fig.  2a), 
which  would  not  contribute  to  improvement  of  search  speed. 
Moreover,  if  a  feature  detector  responds  more  strongly  to  a  distrac¬ 
tor  object  than  to  the  target,  a  reduction  in  gain  of  this  map  would 
occur,  which  could  end  up  turning  off  this  map  completely.  As  a  re¬ 
sult,  only  the  feature  maps  that  can  uniquely  distinguish  the  object 
being  searched  for  are  amplified.  Nonetheless,  if  a  target  object 
contains  a  weak  red  feature  among  strong  red  distractors,  the  weak 
red  signal  could  in  principle  be  used  to  find  the  object  by  guiding 
attention  towards  locations  where  feature  detectors  report  low 
red  values.  Even  if  the  feature  maps  are  divided  into  sub-bands 
with  finer  granularity  (Fig.  2a  and  b)  as  proposed  in  Navalpakkam 
and  Itti  (2006b),  one  can  always  design  search  arrays  in  which  one 
band  can  code  for  both  the  target  and  distractors,  leading  to  a 
failed  discrimination. 

There  have  also  been  many  contributions  to  object  recognition 
and  search  in  the  computer  vision  literature.  These  contributions 
often  concentrate  on  two  aspects  of  the  problem:  developing 
methods  to  extract  features  from  images,  and  creating  algorithms 
to  classify  these  features.  Some  of  the  research  has  also  indepen¬ 
dently  been  focused  on  searching  for  objects  once  particular  fea¬ 
tures  have  been  learned.  For  example,  simple  template  matching 
(Gonzalez  &  Wintz,  1987;  Horn,  1986;  Pratt,  1991;  MacLean  & 
Tsotsos,  2008)  or  back-projection  approaches  (Bradski,  1998; 
Comaniciu  &  Meer,  1997)  use  some  knowledge  (a  template  or 
histogram)  to  check  every  possible  location  in  the  image  for  a  good 
match.  These  techniques  often  fail  when  the  object’s  pose  or 


Fig.  1.  Example  search  in  previous  models  such  as  Feature  Integration  Theory  (Treisman  &  Gelade,  1980;  Treisman  &  Sato,  1990),  Guided  Search  (Wolfe,  1994),  Biased 
Competition  Model  (Desimone  &  Duncan,  1995)  and  Optimal  Gains  (Navalpakkam  &  Itti,  2006a,  2007).  Left  image  shows  the  basic  scheme  of  computing  a  saliency  map  from 
the  weighted  sum  of  various  feature  maps  with  varying  scales  (intensity,  color,  orientation,  etc.).  Biasing  the  saliency  map  towards  a  particular  feature  or  scale  can  be 
achieved  by  changing  the  relative  weights  (w)  between  the  feature  maps.  Right  image  shows  how  greater  granularity  in  biasing  can  be  accomplished  by  splitting  a  particular 
feature  map  into  multiple  sub-bands.  Ultimately,  both  models  fail  to  provide  fine  granularity  in  biasing  for  specific  features  (see  text  for  explanation). 
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Fig.  2.  An  example  of  biasing  using  feature  bands  (a,b)  and  a  likelihood  model  (c,d).  In  both  cases  the  target  (at  spatial  position  50  in  a  ID  slice  of  a  feature  map)  has  a  feature 
value  of  125.  (a)  Shows  how  three  sub-bands  with  mean  feature  responses  at  50, 100, 150  and  standard  deviation  of  10  will  split  the  feature  space,  (b)  Shows  the  ambiguity  in 
the  response  of  sub-bands  B  and  C  when  searching  for  the  target,  whereby  each  sub-band  responds  more  vigorously  to  a  distractor  than  to  the  target  (sub-band  A  does  not 
respond  at  all  to  the  target  and  is  not  shown).  As  a  result,  changing  the  weight  of  any  one  of  them  will  not  yield  a  higher  response  for  the  target,  (c)  Shows  how  knowing  the 
model  of  the  target  can  give  the  granularity  needed  to  find  the  target,  while  (d)  shows  the  response  from  the  learned  model. 


illumination  is  changed.  To  speed  up  search,  an  attentional  frame¬ 
work  proposed  by  Bonaiuto  and  Itti  (2005)  uses  a  bottom-up  sal- 
iency  map  to  rapidly  eliminate  locations  in  scenes  which  are 
unlikely  to  contain  interesting  objects.  Although  they  reported  fas¬ 
ter  results  in  their  searches,  the  system  lacked  a  method  for 
exploiting  top-down  knowledge  about  the  search  target’s  features. 
Obdrzalek  and  Matas  (2005)  have  also  proposed  a  method  which 
helps  Speed  up  the  classification  stage  by  organizing  the  classifier 
into  a  binary  tree  to  achieve  a  log(N)  time  complexity.  Tagare,  Toy¬ 
ama,  and  Wang  (2001)  proposed  a  model  in  which  an  attentional 
Strategy  was  used  to  reduce  overall  computations  by  performing 
fast  but  approximate  image  measurements.  However,  their  com¬ 
putations  involved  finding  parts  of  objects  and  determining  their 
relationships  in  an  approximate  manner.  In  contrast,  the  contribu¬ 
tion  of  the  present  paper  is  to  provide  a  good  feature  set  which  can 
be  quickly  classified  with  a  simple  classifier,  as  well  as  the  ability 
to  use  these  feature  sets  to  create  a  biased  saliency  map  in  order 
to  quickly  find  the  object  in  the  scene  regardless  of  pose.  The  meth¬ 
ods  described  above  can  then  be  used  to  perform  a  more  thorough 
evaluation  of  objects  deemed  by  our  system  to  be  highly  probable 
candidates,  after  these  candidates  have  been  selected  in  a  first 
quick  pass  by  our  algorithm. 

In  this  paper  we  draw  inspiration  from  both  the  computer  vi¬ 
sion  literature  and  models  of  the  visual  cortex  and  present  a  meth¬ 
od  based  on  a  Bayesian  framework  to  account  for  search  and 
recognition  in  a  probabilistic  manner.  In  particular,  a  new  model 
of  combined  attention  and  recognition  is  developed  with  dual 


emphasis.  First,  top-down  biasing  towards  desired  features  should 
be  readily  available  and,  if  possible,  stronger  than  modulating  the 
relative  gains  of  different  visual  features  guiding  search,  as  ex¬ 
plored  in  the  past  (Wolfe,  1994;  Navalpakkam  &  Itti,  2007).  Second, 
a  common  representational  framework  should  be  developed  that 
can  be  used  both  for  biasing  towards  desired  targets  as  well  as 
for  Speeding  up  recognition  when  these  targets  are  found.  We 
name  our  algorithm  SalBayes  which  denotes  our  system’s  marriage 
of  both  saliency  and  Bayesian  modeling. 

From  a  biological  aspect,  this  paper  aims  to  develop  a  new  ap¬ 
proach  which  considers  profiles  of  detectors  that  are  more  likely 
to  respond  to  the  target  by  shaping  their  tuning  curve  towards 
the  target  individually.  In  particular,  we  consider  a  Bayesian  frame¬ 
work  that  uses  the  prior  knowledge  of  the  objects  to  help  shape  the 
response  of  the  detector  profile  in  a  dynamic  manner.  This  ap¬ 
proach  achieves  greater  granularity  in  the  discrimination  ability 
of  the  search  without  the  added  overhead  and  limitations  of  multi¬ 
ple  sub-bands.  Additionally,  the  same  information  learned  during 
recognition  is  used  to  guide  attention.  This  is  achieved  by  learning 
the  likelihood  probability  density  functions  (PDFs)  of  salient  fea¬ 
tures  of  various  objects  and  then  using  these  likelihoods  to  com¬ 
pute  a  probable  location  of  objects  during  a  search  task. 

The  result  of  this  work  is  a  single  computationally  efficient  sys¬ 
tem  which  provides  dual  use.  When  given  a  location  in  an  image, 
the  system  will  output  a  sorted  list  of  objects  and  the  associated 
probabilities  to  the  type  of  those  object  that  can  be  found  at  the  gi¬ 
ven  location.  Alternatively,  when  given  a  description  of  an  object. 
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the  system  will  produce  a  sorted  list  of  locations  and  associated 
probabilities  that  the  given  object  can  be  found  at  a  particular  loca¬ 
tion.  From  these  results,  other  more  comprehensive  models  (which 
would  presumably  be  slower)  can  operate  on  these  lists  to  yield  ro¬ 
bust  object  recognition  and  search.  Hence,  we  address  the  problem 
of  prioritizing  search  and  recognition,  narrowing  down  from  long 
and  unordered  to  shorter  and  ordered  lists  of  candidates,  rather 
than  completely  solving  and  outputting  a  single  recognized  object 
label  at  a  single  location.  We  show  how  this  is  achieved  by  learning 
the  visual  features  of  an  object,  which  is  used  for  recognition  as 
well  as  for  efficient  top-down-guided  search.  In  testing  against 
large  standard  databases  (Amsterdam  Library  of  Object  Images 
(ALOl)  (Geusebroek,  Burghouts,  &  Smeulders,  2005),  Columbia  Ob¬ 
ject  Image  Library  (COIL)  (Nene,  Nayar,  &  Murase,  1996),  and  SOIL- 
47  (Burianek,  Ahmadyfard,  &  Kittle,  2001)),  we  find  that  this  ap¬ 
proach  delivers  robust  machine  vision  performance,  comparable 
and  much  faster  than  other  more  sophisticated,  computationally 
intensive,  and  state-of-the-art  machine  vision  systems  (HMAX, 
SIFT)  for  recognition,  while  additionally  providing  a  common 
framework  for  search  and  recognition. 

In  the  following  section  we  describe  the  model  and  its  compo¬ 
nents.  We  start  with  the  simple  problem  of  object  classification, 
and  of  defining  a  representation  that  can  be  learned  from  example 
views  of  objects.  We  then  explore  how  this  representation  can  also 
be  used  to  provide  efficient  visual  search  for  the  learned  objects. 
Section  3  describes  the  testing  methodologies,  datasets  and  results, 
while  Section  4  provides  discussion  of  the  model  and  results. 

2.  Methods 

The  model  proposed  in  this  paper  draws  its  inspiration  from 
Bayesian  theory  as  well  as  from  the  bottom-up  attention  model  pro¬ 
posed  by  Itti  et  al.  (Itti,  Koch,  &  Niebur,  1998;  Itti  &  Koch,  2000).  By 
learning  the  statistical  variations  in  features  of  various  objects,  the 
model  is  able  to  perform  an  efficient  visual  search  for  a  given  target 
object,  as  well  as  classify  target  and  distractor  objects.  At  its  core,  the 
model  learns  the  probability  of  an  object’s  visual  appearance  having 
a  range  of  values  within  a  particular  feature  map.  In  a  visual  search 
task,  the  model  influences  the  various  feature  maps  by  computing 
the  probability  of  a  given  target  object  for  each  detector  within  a  fea¬ 
ture  map.  As  a  result,  locations  in  the  maps  with  the  highest  proba¬ 
bility  will  be  searched  first,  as  they  indicate  likely  positions  for  the 
target  object.  Both  the  prior  and  likelihood  probabilities  can  be 
learned  from  training  views  of  the  object  and  the  context.  As  we  will 
see,  a  chief  advantage  of  this  approach  is  in  its  simplicity  and  speed, 
which  make  it  an  ideal  candidate  for  a  front-end  system  that  quickly 
narrows  search  down  to  a  few  likely  candidates  which  can  then  be 
investigated  in  more  detail  by  more  sophisticated  and  time-consum¬ 
ing  recognition  algorithms. 

2.1.  Object  representation 

To  uniquely  describe  the  appearance  of  an  object,  a  number  of 
feature  maps  are  computed  from  the  biologically  inspired  bot¬ 
tom-up  saliency  model  proposed  by  Itti  et  al.  (Itti  et  al.,  1998;  Itti 
&  Koch,  2000).  The  saliency  map  represents  statistically  unique 
locations  in  an  image  after  being  decomposed  into  different  feature 
maps  at  several  spatial  scales.  That  is,  the  saliency  map  attempts  to 
detect  anomalies,  or  outliers  in  the  image  within  various  feature 
spaces.  In  this  paper  the  feature  map  domains  consist  of  intensity, 
color  opponency  (red-green,  blue-yellow)  and  four  orientations 
(0°,  45°,  90°,  135°).  These  particular  feature  maps  were  selected 
based  on  the  implementation  proposed  by  Itti  et  al.  (1998)  which 
derived  its  inspiration  from  a  review  of  which  elementary  visual 
features  contribute  to  visual  saliency  in  natural  scenes  (Wolfe 


et  al.,  2004).  In  the  absence  of  top-down  modulations  a  normaliza¬ 
tion  operator,  N(.),  within  each  feature  map  weighs  the  values  of 
detectors  in  a  data-driven  fashion  based  on  their  uniqueness  in 
that  map.  That  is,  the  more  different  the  response  of  a  given  detec¬ 
tor  is  from  its  neighbors  and  globally,  the  higher  the  weight  as¬ 
signed  to  that  detector’s  output.  This  normalization  operator  can 
also  be  thought  of  as  providing  spatial  competition  between  neigh¬ 
boring  pixels.  The  normalizing  operator  is  computed  as  follows  (Itti 
et  al.,  1998): 

1.  Normalize  the  values  in  the  map  to  a  fixed  range  [0. .  .,M],  in 

order  to  eliminate  modality-dependent  amplitude  differences; 

2.  Find  the  location  of  the  maps  global  maximum  M  and  compute 

the  average  rfi  of  all  its  other  local  maxima;  and 

3.  Globally  multiply  the  map  by  (M  -  mf 

The  42  feature  maps  (seven  features  at  six  spatial  scales)  are 
then  combined  into  a  saliency  map,  which  indicates  the  saliency 
of  each  location  in  the  image.  Implementation  details  of  this  model 
have  been  described  previously  (Itti  et  al.,  1998;  Itti  &  Koch,  2000) 
and  the  algorithm  is  freely  distributed  in  source  code  at  http://iLa- 
b.usc.edu/toolkit/. 

To  characterize  the  target,  the  most  salient  features  from  each 
of  the  42  feature  maps  are  sampled  within  a  given  fovea  size  (or 
patch  size)  centered  on  the  object.  Note  that  this  location  does 
not  need  to  be  the  center  of  the  object,  nor  does  the  object  need 
to  be  segmented.  The  only  requirement  is  that  the  object  should 
overlap  with  the  fovea  location.  The  spatial  competition  will  help 
provide  a  consistent  location  from  which  to  sample  when  the  ob¬ 
ject  undergoes  various  transformations  (illumination  direction, 
rotation,  etc.).  Selecting  the  most  salient  location  to  learn  from  also 
helps  in  searching  for  the  object.  For  example,  if  we  know  that  we 
are  looking  for  a  red  dot  on  the  object,  then  it’s  worth  searching  for 
a  red  dot. 

The  motivation  behind  sampling  from  the  most  salient  location 
within  each  submap  around  the  object  is  to  select  features  that 
would  uniquely  describe  the  object,  but  would  still  remain  invari¬ 
ant  to  transformations  in  illumination,  rotation,  translation,  etc. 
This  can  then  provide  a  very  efficient  search  mechanism  when 
attempting  to  narrow  down  possible  objects  during  recognition. 
The  argument  follows  that  a  salient  location  in  an  object  would  re¬ 
main  invariant  to  transformations  since  it  is  very  unique  to  the  ob¬ 
ject.  For  example,  the  model  not  only  learns  that  the  object  has  a 
particular  strong  color  value,  but  also  that  it  has  a  particular  strong 
intensity,  and  particular  orientations.  Therefore,  not  only  the  con¬ 
junctions  of  various  feature  maps  can  yield  a  position  that  is  highly 
salient,  but  also  feature  values  within  each  feature  map  at  these 
strong  locations. 

The  method  of  only  selecting  particular  key  locations  to  de¬ 
scribe  objects  and  scenes,  rather  than  considering  the  entire  pixel 
array,  has  also  been  successfully  used  by  the  SIFT  algorithm  (Lowe, 
2004)  and  has  been  studied  by  Mikolajczyk,  Leibe,  and  Schiele 
(2005).  However,  this  paper  uses  the  saliency  map  described  above 
which  is  a  much  more  elaborate  method  of  determining  the  key- 
point  locations  in  order  to  provide  a  more  robust  feature  set  for 
recognition.  Note  that  only  the  single  most  salient  location  in  each 
feature  map  is  used  to  build  the  descriptor  vector.  This  results  in 
very  quick  recognition  rates,  since  adding  more  locations  would  re¬ 
quire  a  more  complex  model  to  account  for  spatial  locations  within 
them.  In  particular,  the  goal  of  the  model  is  to  code  probable  loca¬ 
tions  and  or  hypotheses  of  particular  objects,  but  not  determine 
them  specifically.  Therefore,  we  would  want  to  use  as  few  features 
with  a  few  complexities  in  order  to  speed  up  initial  recognition  and 
search.  Other,  more  complex  models  (which  would  require  more 
time  to  compute),  would  then  be  used  on  these  locations  in  order 
to  specifically  determine  the  object. 
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Nevertheless,  since  the  features  are  sampled  from  multiple 
scales,  some  spatial  information  is  encoded  in  the  feature  vector 
but  is  not  tightly  localized.  This  is  a  result  of  sampling  from  the 
various  scale-space  pyramids.  Consequently,  features  extracted 
from  a  single  salient  corner  of  a  rectangle  will  yield  a  different  sig¬ 
nature  (vector  of  features)  than  a  signature  obtained  from  a  square. 
This  is  due  to  the  fact  that  a  rectangle  will  occupy  a  different  num¬ 
ber  of  cells  within  the  image,  and  thus  will  show  up  in  a  different 
pyramid  level.  Additionally,  a  more  complex  model  (with  multiple 
features  and  locations)  can  be  considered  but  would  result  in  less 
efficient  recognition  or  search  (especially  when  the  spatial  distri¬ 
butions  of  the  features  are  included).  For  a  similar  approach  to  rec¬ 
ognition  with  a  more  complex  model  (see  Shokoufandeh,  Marsic,  & 
Dickinson,  1998). 

During  training,  the  object  model  descriptor  is  built  by  comput¬ 
ing  the  likelihood  probability  distributions  of  the  42  features 
resulting  from  each  feature  map.  This  PDF  is  modeled  using  a 
Gaussian  distribution  for  each  individual  feature  type,  where  both 
the  mean  and  variance  are  learned.  That  is,  the  algorithm  learns  42 
separate  univariate  Gaussian  distribution  for  each  object.  The 
choice  of  this  distribution  is  due  to  the  simplicity  and  efficiency 
in  obtaining  the  parameters  mean  (jd)  and  variance  in  an 
on-line  method  from  training  images.  Additionally,  these  likeli¬ 
hoods  are  used  later  for  searching  for  the  object.  However,  other 
distributions  can  be  used  such  as  super-Gaussians,  mixtures  of 
Gaussians,  particle  filters,  or  discrete  histograms  (Scott,  1992). 

Given  a  region  of  interest  patch  q  with  N  pixels  from  a  particular 
location  (this  location  will  correspond  to  the  image  being  trained 
with)  from  within  a  given  feature  map  (from  the  42  feature  maps 
computed  above),  the  spatial  competition  method  N(.)  (the  non¬ 
linear  normalization  method  described  above)  is  applied  to  this 
patch  to  form  a  new  set  of  patch  values  q'.  A  feature  vector  f  is  then 
built  using  the  value  of  q  from  the  location  at  which  q'  has  a  max¬ 
imum  response.  This  value  then  forms  the  jth  component  of  the 
feature  vector  f,  and  is  denoted  Fj.  In  other  words  we  select  the 
center-surround  feature  that  has  the  highest  value  in  the  spatial 
competition  layer  (the  most  unique  feature  in  that  map). 

Fj  =  q[argmax(q'i)f^j  Vj  e  F  (1 ) 

where  i  represents  the  pixel  position  within  the  patch,  Fj  is  the  par¬ 
ticular  feature  value  from  feature  map  j  and  F  is  the  set  of  feature 
maps. 

The  Normal  distribution  is  then  used  to  estimate  the  likelihood, 
PiFjlOj),  of  observing  feature  Fj  given  a  particular  object  class 
parameter  for  this  feature  6j.  For  example,  if  j  is  the  index  of  the 
vertical  Gabor  detector  channel,  then  Fj  would  represent  the  re¬ 
sponse  of  that  channel  at  it’s  most  ’unique’  location,  as  determined 
by  N(.).  6j  would  then  represent  the  learned  mean  and  variance  of 
the  vertical  Gabor  responses  for  object  6. 

The  final  model  (6)  is  then  a  set  of  n  parameters  (6»j),  each  com¬ 
posed  of  a  mean  (p)  and  a  variance  for  each  individual  feature 
map.  This  gives  the  ability  to  simply  compute  the  model  parame¬ 
ters  {6)  mean  {p)  and  variance  (a^)  from  the  training  views  of 
the  object  within  each  feature  map,  and  to  use  a  Gaussian  distribu¬ 
tion  to  estimate  the  likelihood. 

PiFjiej)  CC  N(Fj-,Hj-,  aj)  =  (2) 

(7jV2n 

When  learning  from  only  a  single  view,  the  standard  deviation  (a)  is 
initially  set  to  a  fixed  value  of  0.001,  which  was  chosen  arbitrarily 
(this  number  should  be  small  so  that  the  particular  feature  detector 
will  provide  some  discrimination).  This  gives  the  classifier  a  rough 
estimate  of  the  classification  of  the  object  with  only  one  training 
view  of  the  object,  while  fully  computing  the  variance  requires 
more  than  one  training  view. 


2.2.  Object  classification 


To  classify  particular  features  obtained  from  the  feature  maps,  a 
naive  Bayesian  network  is  used.  The  choice  of  a  naive  Bayesian  net¬ 
work  in  the  model  was  made  to  reduce  the  amount  of  computations 
necessary  for  classification,  as  this  type  of  network  assumes  statisti¬ 
cal  independence  between  feature  values.  Since  some  of  the  features 
are  derived  from  different  scales  in  the  image,  our  features  are  actu¬ 
ally  guaranteed  to  be  statistically  dependent.  However,  it  has  been 
shown  that  even  if  the  features  are  statistically  dependent  upon 
one  another,  computing  the  full  network  often  only  increases  accu¬ 
racy  by  a  small  amount,  whereas  the  computations  necessary  to 
achieve  this  small  improvement  are  large  (Rish,  2001).  As  further 
evidenced  in  Vasconcelos  and  Vasconcelos  (2009),  for  image  classi¬ 
fication,  modeling  the  joint  distribution  between  pairwise  features 
provides  often  only  a  marginal  performance  boost. 

Once  a  set  of  features  (F)  is  collected  from  a  given  salient  loca¬ 
tion  within  the  feature  maps  (as  described  above),  the  classifica¬ 
tion  is  performed  using  Bayes  formula: 


P{0i\F) 


P{P\0i)Pi0i) 

P{P) 


(3) 


To  make  a  decision  as  to  the  type  of  classification  assigned  to  an  ob¬ 
ject,  i  can  be  iterated  over  all  known  objects  and  the  object  with  the 
greatest  posterior  can  be  chosen  as  the  best  match.  This  method  is 
known  as  Maximum  a  Posteriori  (MAP).  However,  the  goal  of  the 
model  is  to  act  as  a  fast  front-end  to  slower,  more  accurate  object 
recognition  systems,  and  so  we  instead  output  a  list  of  objects 
and  match  likelihoods  sorted  by  the  probability  that  each  object 
matches  the  requested  location.  In  our  experiments,  the  prior  is  ta¬ 
ken  to  be  1/C,  where  C  is  the  number  of  classes.  This  results  in  each 
class  being  equally  probable  to  observe  (uninformed  prior).  How¬ 
ever,  changing  the  prior  in  response  to  outside  knowledge,  could 
yield  better  classification  rates  if  within  a  given  scene  the  probabil¬ 
ity  of  a  particular  object  appearing  can  be  determined. 

Since  the  probability  of  the  evidence  can  be  viewed  as  a  normal¬ 
izing  constant  (used  to  ensure  that  probabilities  all  add  up  to 
unity),  it  can  be  dropped  from  the  equation.  This  is  because  the 
comparison  of  the  posterior  is  between  classes,  and  only  the  great¬ 
est  one  is  selected  and  not  its  particular  value  (the  scale  of  the  va¬ 
lue  is  insignificant).  Furthermore,  the  assumption  that  features  are 
statistically  independent  from  one  another  simplifies  the  calcula¬ 
tion  to  just  multiplying  the  likelihoods  together  to  come  up  with 
a  decision,  as  opposed  to  calculating  the  full  joint  distribution  be¬ 
tween  the  features. 


p(0,|F)  =  argmax,.  ^p(0)  np(Fj|0i,)  j  (4) 

Additionally,  taking  the  product  of  many  probabilities,  some  of 
which  may  be  very  small,  can  give  rise  to  numerical  instability.  As 
a  result,  an  underflow  often  occurs  in  a  straightforward  implemen¬ 
tation  of  Eq.  (3)  when  using  more  than  a  few  features.  A  solution  to 
this  problem  is  to  take  the  log  of  the  likelihood  which  will  convert 
the  probabilities  from  being  less  than  one  to  negative  numbers 
greater  than  one.  This  also  greatly  simplifies  the  computations  in 
our  practical  implementation,  as  likelihood  products  are  trans¬ 
formed  into  likelihood  summations.  Also,  the  decision  to  select  a 
suitable  classification  is  not  affected,  since  only  the  maximum  of 
these  values  is  considered.  As  a  result  of  these  various  techniques, 
Eq.  (3)  can  be  described  by  the  following  formula: 

p(0i|F)  =  argmax,.  ^p(P)  ^  log(p(F,.|P,j))  j  (5) 

The  enhanced  version  of  the  saliency  map  with  the  Bayesian  net¬ 
work  used  for  object  recognition  can  be  seen  visually  in  Eig.  3. 
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Fig.  3.  The  added  Bayesian  network  to  the  saliency  model  for  object  recognition.  Red  indicates  added  components  and  data  paths.  The  toy  soldier  at  the  input  image  is 
selected  for  learning/classification  indicated  by  a  green  circle.  The  maximum  feature  location  in  each  center-surround  feature  map  is  used  to  train  or  classify  the  Bayesian 
network  for  the  selected  object  (indicated  by  the  probability  map  on  the  left  side).  Each  feature  map  builds  a  probability  distribution  of  the  most  salient  location  in  that  map, 
shown  in  red.  The  rest  of  the  architecture  is  as  previously  described. 


2.3.  Biasing  learned  features  for  efficient  object  search  in  a  Bayesian 
framework 

Once  the  parameters  of  a  particular  object  are  known,  they  can 
be  used  to  search  for  the  object  in  an  efficient  manner.  This  is 
accomplished  by  biasing  the  feature  maps  to  influence  the  saliency 
map  so  that  the  object  that  is  being  searched  for  becomes  more 
salient,  which  can  result  in  a  faster  search  times  simply  by  sorting 
by  salience.  For  example,  if  our  bottom-up  saliency  computations 
considered  bright  locations  as  salient,  then  darker  locations  would 
often  be  considered  last  as  possible  targets.  However,  if  our  object 
was  dark  in  color,  then  biasing  the  saliency  computations  to  choose 
darker  locations  as  salient  should  improve  search  time,  which 
would  result  in  the  biased  saliency  highlighting  darker  locations 
first. 

The  saliency  map  is  biased  by  using  the  knowledge  of  the  target 
parameters,  and  applying  them  to  the  set  of  feature  detectors  that 
are  computed.  Particularly,  the  parameters  of  our  target  are  used  to 
look  for  a  particular  mean  and  a  variance  within  a  given  feature 
map.  These  parameters  can  be  thought  of  as  an  envelope  limiting 
the  feature  map  response.  In  other  words,  the  feature  map  would 
have  its  activation  shaped  by  the  likelihood  of  the  particular  fea¬ 
ture  value  belonging  to  the  object.  Although  our  system  could  be 
thought  of  as  only  considering  one  sub-band,  that  sub-band  can 
be  dynamically  shaped  (regarding  its  position  along  the  feature 
spectrum,  and  its  specificity  or  width),  thus  providing  an  interest¬ 
ing  alternative  to  using  several  fixed  sub-bands.  The  result  in  the 
feature  map  then  gives  the  probability  of  our  object  being  coded 
by  a  particular  feature  detector.  The  maximum  location  within 
the  feature  map  would  then  give  an  indication  of  the  possible  tar¬ 
get  location  (Fig.  2c,d).  The  biasing  process  (applying  the  likelihood 


model  to  the  feature  map)  is  repeated  within  each  feature  map  and 
the  combination  of  all  the  feature  maps’  information  is  used  to  cre¬ 
ate  a  saliency  map  where  the  maximum  indicates  the  most  proba¬ 
ble  location  of  our  target.  The  enhanced  version  of  the  saliency 
map  with  the  Bayesian  network  used  for  finding  objects  can  be 
seen  in  Fig.  4. 

The  various  feature  maps  in  the  saliency  map  are  biased  in  the  fol¬ 
lowing  way:  First  the  feature  maps  are  computed  by  creating  an  im¬ 
age  pyramid  of  each  feature  type  and  taking  the  difference  between 
the  pyramids  to  form  center-surround  responses  at  various  scales  as 
proposed  in  the  original  saliency  algorithm  (Itti  et  al.,  1998;  Itti  & 
Koch,  2000).  There  are  42  such  maps  labeled  fi  •  •  •  F42  (four  orienta¬ 
tions,  one  intensity,  one  blue-yellow,  and  one  red-green  all  at  six 
different  scales).  Note  that  spatial  competition  is  not  computed  on 
these  feature  maps  and  just  the  raw  center-surround  values  are 
used.  However,  it  is  important  to  remember  that  the  spatial  compe¬ 
tition  was  used  when  extracting  the  feature  values  during  the  train¬ 
ing  phase.  From  the  learned  parameters  of  a  particular  object  6  the 
parameters  {jUj  and  oj  )  corresponding  to  a  particular  feature  map  Fj 
are  used  to  calculate  the  probability  of  a  particular  detector  belong¬ 
ing  to  the  target  p(Fj  1 6»j)  in  feature  map  j.  This  is  done  across  the  n  dif¬ 
ferent  feature  maps.  The  maps  are  then  multiplied  together  (instead 
of  the  sum  which  was  used  in  the  original  model)  to  yield  the  final 
saliency  map.  Therefore,  the  resulting  saliency  map  calculates  the 
probability  of  a  location  containing  a  given  target  (p(F\0)).  Again, 
to  avoid  numerical  instability  and  to  speed  up  computation,  the 
log  of  the  probability  is  used. 

l0g(p(F|e))  =  log(np(fj;  9j))  «  f)  log(N(Fj;  H-,  (Tj))  (6) 

i=i  i=i 
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Fig.  4.  The  added  Bayesian  network  to  the  saliency  model  for  object  recognition.  Red  indicates  added  components  and  data  paths.  The  input  image  is  passed  for  processing 
(without  the  selected  object  which  is  indicated  in  this  image  for  clarity)  by  the  saliency  computations  in  the  normal  way.  After  the  center-surround  operations,  the  parameters  of 
the  object  are  used  to  find  the  probability  of  a  detector  indicating  the  position  of  the  object  in  each  submap  (depicted  as  3D  graphs  in  the  figure).  All  the  submaps  are  then 
multiplied  together  to  form  the  saliency  map  (note  that  in  the  implementation  the  multiplications  are  converted  to  additions  by  the  used  of  the  log  operation). 


N  is  the  normal  function  and  F  is  the  set  of  all  42  features.  Again, 
the  spatial  competition  on  the  whole  saliency  map  is  not  per¬ 
formed  during  object  search.  This  is  due  to  the  nature  of  the  spatial 
competition,  which  tends  to  punish  high  values  within  the  same 
uniform  region.  Since  that  region  describes  the  probability  of  the 
object,  that  location  should  be  allowed  to  contribute  to  the  overall 
saliency  map. 

Once  a  biased  saliency  map  is  computed,  the  locations  with  the 
highest  locally  maximal  values  in  that  map  are  searched  first.  That 
is,  the  object  model  is  used  on  the  locations  which  are  local  max¬ 
ima  in  the  biased  saliency  map.  This  processes  in  known  as  atten¬ 
tion  shifts.  Finding  local  maxima  is  achieved  by  selecting  the 
maximum  value  in  the  saliency  map  and  applying  an  inhibition 
of  return  (lOR)  mechanism  to  that  location.  The  lOR  is  performed 
by  applying  a  Gaussian  disk  mask  with  fixed  radius  to  the  saliency 
map  which  set  all  salient  values  underneath  the  mask  toward  zero, 
so  that  the  next  maximum  salient  location  would  have  to  be  out¬ 
side  the  disk.  Implementation  details  of  this  mechanism  have  been 
described  previously  (Itti  et  al.,  1998;  Itti  &  Koch,  2000). 

3.  Results 

The  model  was  tested  on  three  publicly  available  datasets  to 
evaluate  its  performance  in  both  object  recognition  and  object 
search  tasks. 


3.1.  Object  recognition 

For  the  object  recognition  task,  several  challenging  datasets 
were  used.  These  datasets  included  objects  under  many  transfor¬ 
mations  including  rotations  and  various  viewpoints,  illumination 
changes  and  illumination  color  changes.  The  original  idea  of  the 
experiment  was  to  use  SIFT  (Lowe,  2004)  on  top  of  the  output  of 
our  model  in  order  to  speed  up  the  search  for  object  during  recog¬ 
nition.  However,  during  our  preliminary  experiments  we  found 
that  using  SIFT  did  not  actually  provide  better  recognition  results 
than  the  raw  output  of  our  model.  As  a  result,  we  directly  compare 
the  recognition  capabilities  of  SalBayes  against  state-of-the-art  ob¬ 
ject  recognition  methods:  the  SIFT  (Lowe,  2004)  algorithm  as  pro¬ 
posed  by  Lowe  and  the  HMAX  algorithm  with  feature  learning 
proposed  by  Serre,  Wolf,  and  Poggio  (2005).  These  two  methods 
were  chosen  due  to  their  popularity  in  the  machine  vision  and  cog¬ 
nitive  modeling  community.  For  example,  HMAX  has  been  used  to 
explain  basic  facts  about  the  ventral  visual  system  (Riesenhuber  & 
Poggio,  1999)  and  has  been  used  in  object  recognition  (Serre  et  al., 
2005,  Serre,  Wolf,  Bileschi,  Riesenhuber,  &  Poggio,  2007),  while 
SIFT  has  been  used  to  build  3D  models  of  objects  (Snavely,  Seitz, 
&  Szeliski,  2006),  robotics  navigation  (Se,  Lowe,  &  Little,  2002;  Eli¬ 
nas  &  Little,  2005;  Sim,  Elinas,  &  Griffin,  2005;  Barfoot,  2005)  and 
object  classification  (Lowe,  1999).  Due  to  the  large  amount  of  data, 
a  Beowulf  cluster  consisting  of  eight  dual-dual-core  Opterons(tm) 
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(four  cores  per  node  for  a  total  of  32  cpus)  running  at  2.6  GHz  was 
used  to  run  the  algorithms  in  a  parallel  fashion.  The  cumulative 
amount  of  CPU  time  taken  for  the  testing  sets  was  captured  to 
compare  the  efficiency  of  the  models. 

The  SIFT  implementation  was  obtained  from  the  author’s  web¬ 
site  (Lowe,  2004),  but  the  matching  keypoints  software  was  chan¬ 
ged  slightly  to  provide  keypoint  matching  against  a  large  dataset. 
In  particular,  a  k-nearest  neighbor  algorithm  (with  /<  =  2)  was  used 
to  determine  the  object  identity  given  a  test  image  and  an  image 
database.  An  implementation  of  HMAX  with  feature  learning  in 
Matlab  was  obtained  directly  from  the  author’s  website  (http:// 
cbcl.mit.edu/software-datasets/standardmodel/index.html).  How¬ 
ever,  due  to  the  large  amount  of  data,  the  software  was  slightly 
modified  to  compute  the  features  for  all  objects  under  all  transfor¬ 
mations  and  save  them  to  a  file  first.  This  allowed  us  to  extract  the 
features  in  parallel  using  the  Beowulf  cluster.  An  SVM  algorithm 
with  a  RBF  kernel  was  used  for  training  and  testing.  The  implemen¬ 
tation  of  the  SVM  was  obtained  from  Chang  and  Lin  (2001). 

We  test  the  proposed  algorithm  (along  with  HMAX  and  SIFT) 
against  three  large  standard  databases  (ALOl,  COIL,  SOlL-47)  sepa¬ 
rately  and  all  together.  The  datasets  are  systematically  broken  into 
training  and  testing  sets  composed  of  the  various  images  in  the 
dataset.  These  sets  include  1  image  for  training  and  the  rest  for 
testing,  6.25%  training  93.75%  testing,  12.5%  training  87.5%  testing, 
25%  training  75%  testing  and  50%  training  and  50%  testing.  The  first 
object  recognition  dataset  used  was  the  Amsterdam  Library  of  Ob¬ 
ject  Images  (ALOl)  (Geusebroek  et  al.,  2005).  This  dataset  contains 
photographs  of  1000  objects  placed  on  a  turntable  and  subjected  to 
various  transformations.  These  transformations  include  12  illumi¬ 
nation  colors,  24  illumination  directions,  and  72  viewpoints  (each 
object  was  rotated  in  steps  of  5°).  All  photographs  were  first  scaled 
down  to  a  256  x  256  pixel  image  to  speed  up  computations.  Sev¬ 
eral  splits  of  the  entire  dataset  into  training  and  testing  sets  were 
considered,  from  using  only  one  instance  of  each  transformation 


(three  images  total)  for  training,  to  using  half  of  the  dataset  for 
training.  Object  recognition  testing  was  then  performed  on  all 
1000  objects  on  transformations  that  were  not  in  the  training  data¬ 
set.  The  next  data  set  used  was  the  Columbia  Object  Image  Library 
(COIL)  (Nene  et  al.,  1996)  which  consisted  of  photographs  of  100 
objects  under  72  rotated  views.  The  7200  color  images  of 
128  X  128  pixels  were  obtained  by  placing  objects  in  the  center 
of  a  turntable  that  was  rotated  at  5°  increments.  Here  again  several 
splits  into  training  and  testing  sets  were  tested,  from  using  only 
the  first  image  for  training  and  all  others  for  testing,  to  using  half 
of  the  dataset  for  training  and  the  other  half  for  testing.  Object  rec¬ 
ognition  was  then  performed  on  all  100  objects  and  on  views  that 
were  not  in  the  training  datasets.  The  last  dataset  used  was  the 
SOlL-47  (Burianek  et  al.,  2001)  comprising  photographs  of  47 
household  objects.  The  images  were  obtained  by  placing  a  camera 
on  a  robot  arm  and  moving  it  to  various  positions.  In  addition,  the 
objects  were  also  subjected  to  two  illumination  conditions.  We 
again  created  training  sets  that  ranged  from  just  a  single  instance 
of  each  object,  to  half  the  dataset  of  the  various  views  of  the  object. 
In  addition,  one  of  the  illumination  conditions  for  each  of  the  two 
illumination  conditions  was  used  for  training.  Testing  was  then 
performed  on  all  objects  and  on  views  that  were  not  in  the  training 
datasets. 

The  results  show  that  under  many  object  transformations  the 
model  was  able  to  successfully  learn  the  objects,  classify  them  cor¬ 
rectly  and  search  for  them  in  an  efficient  manner.  In  particular. 
Fig.  5  and  Table  1  shows  that  the  model  was  able  to  classify  the 
large  datasets  tested  on  average  over  88.64%  correctly.  As  indicated 
in  Fig.  5,  the  HMAX  algorithm  was  able  to  achieve  a  92.46%  classi¬ 
fication  rate  on  the  ALOl  dataset.  Although  this  is  a  slight  improve¬ 
ment  over  the  proposed  method,  it  should  be  noted  that  the 
features  computed  in  the  HMAX  algorithm  are  2000  dimensions 
in  size  and  take  more  then  46  s  to  compute  per  image,  as  compared 
to  the  proposed  model  which  uses  only  42  features  and  is  279 


ALOl  Coil  Soil47 


Fig.  5.  Classification  rates  as  a  function  of  training  size  obtained  by  the  proposed  algorithm  SalBayes,  SIFT  and  HMax. 
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Table  1 

Average  recognition  from  the  various  datasets  using  25%  of  the  data  for  training.  N  represents  the  number  of  images  in  the  testing  set.  The  work  of  others  have  been  included  in 
this  table  to  place  the  performance  in  context.  To  our  knowledge,  no  one  has  before  us  used  all  of  the  1000  objects  in  the  ALOl  database  under  all  conditions. 


Method 

Classification  rate  (%) 

ALOl  N=  81,000 

COIL  N=  5400 

SOIL47  N=1410 

ALL  datasets  N=  87,810 

SalBayes 

83.83 

97.20 

84.89 

88.64 

SIFT 

72.68 

87.19 

94.48 

84.78 

HMAX 

83.42 

77.02 

57.87 

72.77 

MNS  (Murthy,  2007) 

- 

99.91 

100.00^ 

- 

LAF  (Obdrzalek  &  Matas,  2002) 

- 

99.90 

100.00^ 

- 

Graph  matching  (Kittler  &  Ahmadyfard,  2001) 

- 

- 

73.0 

- 

Extra  trees  (Maree  et  al.,  2005) 

- 

99.50 

- 

- 

Sub-windows  (Geurts  et  al.,  2004) 

- 

99.61 

- 

- 

SNoW/edge  (Roth  et  al.,  2002) 

- 

94.13 

- 

- 

SNoW/intensity  (Roth  et  al.,  2002) 

- 

92.31 

- 

- 

Linear  SVM  (Roth  et  al.,  2002) 

- 

91.30 

- 

- 

NN  (Roth  et  al.,  2002) 

- 

87.50 

- 

- 

Table  2 

Recognition  results  on  the  ALOl  dataset  under  the  various  transformations  using  25%  of  the  data  for  training.  N  represents  the  number  of  images  in  the  testing  set.  The  fifth 
column  is  the  performance  rate  obtained  when  using  all  images  (all  images  from  A,  B  and  C),  while  the  sixth  column  represent  an  unweighted  average  of  performance  obtained  for 
A,  B  and  C  (if  the  same  number  of  transformations  where  equally  likely  to  occur.). 


Method 

Classification  rate  (%) 

A.  Changes  in  illumination 
color  only  N  =  9000 

B.  Changes  in  illumination 
direction  only  N  =  18,000 

C.  Changes  in  rotation 
only  N=  54,000 

All  images  from  A,  B, 
and  CN=  81,000 

Unweighted  average  of 
performance  for  A,  B  and  C 

SalBayes 

64.79 

75.50 

89.71 

83.83 

76.67 

SIFT 

89.41 

71.47 

70.95 

72.68 

77.28 

HMAX 

99.04 

83.13 

80.76 

83.42 

87.64 

times  faster  (0.165  s  per  image).  Moreover,  the  increase  in  perfor¬ 
mance  was  only  achieved  when  training  on  half  of  the  dataset, 
which  means  that  the  difference  between  a  training  image  and  a 
testing  image  is  not  large.  As  seen,  the  proposed  algorithm,  SalBa- 
yes,  was  able  to  achieve  better  performance  with  less  training  data 
at  speeds  which  greatly  surpass  both  HMAX  and  SIFT.  Examining 
the  different  datasets,  it  can  be  seen  that  the  proposed  model 
was  able  to  learn  the  object  features  from  only  a  few  training 
examples  (less  then  five  per  object)  and  achieve  good  results.  In 
particular,  the  COIL  dataset  shows  that  from  five  training  exam¬ 
ples,  the  model  was  able  to  correctly  identify  91.28%  of  the  test 
images  correctly.  Lastly,  the  test  result  also  show  that  the  system 
performs  fairly  well  when  using  only  gray  value  images  (just  like 
HMAX  and  SIFT).  This  indicates  that  the  proposed  system  can  still 
provide  useful  information  in  the  absence  of  color  information. 

Because  the  ALOl  dataset  contained  the  most  systematic  trans¬ 
formations,  further  analysis  was  done  to  determine  the  classifica¬ 
tion  rate  under  each  type  of  transformation.  Looking  at  Table  2 
we  can  see  that  HMAX  performs  best  under  several  transforma¬ 
tions.  In  particular,  it  does  exceptionally  well  on  the  illumination 
color  task.  On  the  other  hand,  our  new  model  performs  well  in 
the  illumination  color  task  when  considering  gray  value  images. 
Additionally,  the  model  does  exceptionally  well  under  the  rotation 
task.  This  shows  the  model’s  robustness  against  rotation  and  pos¬ 
sible  other  3D  transformation  (as  can  be  seen  in  the  soil47  dataset) 
as  a  result  of  picking  the  most  salient  features  to  remember  when 
determining  the  classification  of  an  object. 

Looking  at  the  timing  aspects  of  the  models  tested,  it  can  be 
seen  that  the  proposed  method,  SalBayes,  outperforms  both  SIFT 
and  HMAX  by  many  folds.  Examining  the  results  from  Fig.  6,  it 
can  be  seen  that  for  testing  on  half  of  the  ALOl  dataset,  it  took  only 
3.42  h  for  the  SalBayes  algorithm  as  opposed  to  4878.3  h  for  SIFT 
and  678.55  h  for  HMax.  On  average  across  all  the  datasets  the 
new  model  was  more  than  1500  times  faster  than  SIFT  and  279 
times  faster  than  HMAX. 


3.2.  Grid  based  object  search 

The  visual  search  task  was  evaluated  by  creating  a  dataset 
which  consisted  of  search  arrays  created  from  the  ALOl  images. 

Figs.  7  and  8  shows  an  example  of  a  scene  created  for  the  search 
task.  The  scenes  were  created  by  taking  random  objects  from  the 
ALOl  dataset  under  random  transformations  (from  all  1000  ob¬ 
jects)  and  placing  them  ina2x2ora5x5  grid  pattern.  A  random 
object  was  then  chosen  as  the  target  object  and  the  system 
searched  for  that  target.  This  resulted  in  search  images  of  size 
512x512  pixels  for  the  2x2  grid  and  1280  x  1280  for  the  5x5 
grid  (256  x  256  pixels  per  object).  The  parameters  for  the  objects 
that  were  learned  from  training  on  half  of  the  dataset  as  described 
above  were  used  in  this  task.  The  number  of  “attention  shifts” 
(inspections  of  individual  objects)  taken  to  reach  the  target  object 
was  then  recorded.  The  inhibition  of  return  (lOR)  size  was  set  to  30 
pixels  radius.  This  meant  that  only  a  small  portion  of  the  image 
would  be  inhibited  at  a  time.  As  a  result,  multiple  fixations  per  ob¬ 
ject  could  result  if  the  object  has  strong  multiple  salient  location, 
which  would  lead  to  greater  number  of  fixations  than  the  grid  will 
allow. 

Fig.  8  show  the  number  of  scenes  vs.  the  number  of  attention 
shifts  taken  to  reach  the  target  object.  The  result  show  that  during 
the  search  task,  only  4.2  attended  locations  were  required  on  aver¬ 
age  (with  standard  deviation  of  5.9)  to  be  examined  in  order  to  find 
the  target  object.  About  218  fixations  (290  fixations  for  30  pixels 
lOR  for  the  whole  image  minus  72  fixation  for  one  256  image) 
would  be  needed  to  systematically  cover  the  whole  image  for  the 
2x2  and  1692  fixations  for  the  5x5  array.  In  particular,  in  these 
synthetically-generated  scenes,  the  model  was  able  to  find  the  tar¬ 
get  object  in  fewer  than  five  attended  locations  in  over  76%  of  the 
scenes  (average  of  5  x  5  and  2x2  search  arrays).  Since  the  scenes 
could  have  contained  any  one  of  the  1000  objects,  the  ambiguity  in 
the  various  scenes  is  large.  For  example,  a  few  objects  are  green 
boxes,  where  the  only  varying  feature  is  the  size.  Additionally,  in 
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Fig.  6.  Total  CPU  time  required  for  testing,  as  a  function  of  the  fraction  of  each  dataset  that  was  used  for  training.  As  more  images  are  used  for  training,  fewer  images  remain 
for  testing  (hence  the  decrease  in  processing  time  for  UMAX),  but,  in  the  case  of  SIFT,  a  larger  keypoint  database  is  built. 


Fig.  7.  Example  5x5  search  scene  built  from  the  ALOI  dataset.  Scenes  were  created  by  taking  random  objects  from  the  ALOI  dataset  under  random  transformations  (from  all 
1000  objects)  and  placing  them  in  either  a2x2ora5x5  grid  pattern. 
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SalBayes  Search  Arary  results 


Percentage  of  scenes 

Fig.  8.  Search  results  for  the  various  scenes.  The  number  of  attention  shifts 
(saccades)  is  plotted  against  the  percentage  of  all  scenes.  For  about  60  percent  of  the 
2x2  scenes  the  object  was  located  within  the  first  fixation,  and  for  about  38 
percent  of  the  5x5  scenes  the  object  was  located  within  the  first  fixation. 

some  of  the  images,  the  object  was  never  found  due  to  a  zero  sal- 
iency  value.  Presumably,  an  exhaustive  search  would  take  place  on 
the  locations  that  were  not  searched. 

3.3.  Satellite  image  search 

Another  search  task  consisted  of  finding  houses  in  satellite 
images.  This  task  consisted  of  satellite  images  (786  x  786  pixels) 
obtained  from  the  New  Orleans  region  after  hurricane  Katrina. 
An  example  application  of  this  type  of  search  would  be  to  deter¬ 
mine  the  number  of  houses  effected  by  a  natural  disaster  in  an 
autonomous  manner.  This  can  be  achieved  by  comparing  the  num¬ 
ber  of  houses  found  before  and  after  a  disaster.  Since  satellite 
images  contain  a  lot  of  data,  it  is  often  difficult  for  humans  to 
quickly  find  places  of  interest  in  these  images.  In  this  task,  the 
model  was  set  to  find  images  containing  houses,  so  that  humans 
can  determine  what  do  to  with  these  regions  (provide  food, 
estimate  the  disaster  area,  etc.)  The  system  was  trained  with  38 
instances  of  houses  obtained  from  10  such  satellite  images 


(786  X  786  pixels),  and  was  tested  on  finding  95  houses  spread 
out  across  20  images.  On  an  average  each  image  contained  five 
houses  with  standard  deviation  of  1.94  while  occupying  about 
50  X  50  pixels.  All  images  were  hand  labeled  and  a  house  was  con¬ 
sidered  found  if  it  was  within  a  30  pixel  radius  region  of  interest. 
For  comparison,  the  same  search  task  was  used  with  the  optimal 
gains  proposed  in  Navalpakkam  and  Itti  (2007). 

Fig.  9  shows  some  of  the  training  images  used  for  the  houses 
while  Fig.  1 1  shows  a  typical  satellite  image  upon  which  our  model 
was  used  to  find  houses.  To  evaluate  how  well  the  Gaussian  distribu¬ 
tions  fit  the  underlying  probability  distributions,  the  feature  values 
were  fit  using  a  smoothing  normal  kernel  function  with  a  sliding 
window.  The  results  shown  in  Fig.  10  indicate  that  the  distributions 
do  in  fact  resemble  a  Gaussian  distribution.  However,  note  that  in 
some  cases  the  distributions  are  highly  peaked,  which  suggests  that 
a  super-Gaussian  model  may  provide  a  slightly  better  fit. 

As  can  be  seen  in  Fig.  11,  not  all  attended  locations  fell  within 
houses,  but  the  majority  of  locations  did.  On  an  average  it  took 
1.52  searched  locations  with  a  standard  deviation  of  2.95  to  find 
a  house.  The  optimal  gains  method  found  a  house  within  1.95 
searched  location  on  average  with  a  standard  deviation  of  1.51. 
Fig.  12  shows  the  percentage  of  the  image  that  needed  to  be 
searched  in  order  for  find  the  houses  in  all  20  images.  These  results 
show  that  on  average  after  searching  about  25%  of  the  image,  all 
houses  were  found. 

Fig.  12  also  shows  that  the  optimal  gains  method  performed 
slightly  better  when  finding  the  first  few  houses,  but  took  much 
longer  than  our  method  to  discover  the  more  difficult  targets.  This 
slight  improvement  in  initial  performance  is  likely  due  to  the  fact 
that  the  optimal  gains  model  considers  both  the  target’s  and  distrac- 
tors’  features  in  order  to  compute  the  best  gain  values.  On  the  other 
hand,  the  SalBayes  method  only  uses  knowledge  from  the  object  to 
find  the  object.  After  finding  a  few  houses,  the  performance  of  the 
optimal  gains  model  drops  considerably.  This  is  mainly  due  to  the 
max  normalization  method  (see  Section  2.1  and  Navalpakkam  &  Itti, 
2007  for  details),  which  allows  features  which  ’pop  out’  from  the 
scene,  yet  are  unrelated  to  the  targets,  to  compete  with  those  targets 
whose  features  are  less  visually  unique. 


4.  Discussion 

In  this  paper,  we  have  developed  a  new  unified  model  of  atten- 
tional  guidance  and  recognition  which  exploits  the  duality 


Fig.  9.  Training  images  used  to  find  houses  in  the  satellite  images.  The  system  was  trained  with  38  instances  of  houses  obtained  from  10  satellite  images  786  x  786  pixels  in 
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Fig.  10.  Probability  distribution  of  the  houses  for  the  various  feature  maps  using  a  smoothing  normal  kernel  function  with  a  sliding  window.  The  features  are  broken  down  in 
a  grid  where  the  rows  indicate  the  feature  type  (intensity,  color  opponency  (red-green,  blue-yellow)  and  four  orientations  0°,  45°,  90°,  135°),  while  the  columns  indicate  the 
scale  (1  being  the  coarsest  and  6  being  the  finest). 


Fig.  11.  Typical  results  for  finding  houses.  The  small  yellow  square  indicates  the 
fixation  point,  while  the  yellow  circle  indicates  the  inhibition  of  return  size.  The 
arrow  shows  the  order  in  which  the  fixation  points  where  chosen  (which 
corresponded  to  saliency  values).  As  can  be  seen,  not  all  attended  locations  fell 
within  houses,  but  the  majority  of  locations  did.  (For  interpretation  of  the 
references  to  color  in  this  figure  legend,  the  reader  is  referred  to  the  web  version 
of  this  article.) 

between  these  two  tasks.  On  the  one  hand,  when  the  model  is  pro¬ 
vided  with  a  description  of  an  object,  it  will  output  a  probability 
map  describing  the  likelihood  that  the  object  can  be  found  at  each 


Fig.  12.  Percentage  of  houses  found  vs.  percentage  of  image  searched  for  the  20 
satellite  images.  Red  line  indicates  the  baseline  performance  if  we  tried  to  find 
houses  at  random.  The  green  line  indicates  the  performance  achieved  by  SalBayes, 
while  the  dashed-blue  line  indicates  Optimal  Gains  (see  Navalpakkam  and  Itti, 
2007).  (For  interpretation  of  the  references  to  color  in  this  figure  legend,  the  reader 
is  referred  to  the  web  version  of  this  article.) 


location  in  an  input  image.  On  the  other  hand,  when  provided  with 
only  a  location  in  an  input  image,  the  model  will  provide  a  list  of 
probabilities  denoting  the  likelihood  that  each  of  it’s  known  ob¬ 
jects  is  located  at  the  given  location.  As  shown  in  the  results,  the 
model  performs  informed  search  better  than  previous  related  ef¬ 
forts  when  given  difficult  targets,  and  has  shown  recognition  per¬ 
formance  that  is  on  par  with  current  state-of-the-art  methods 
while  providing  very  significant  speed  gains. 
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To  our  knowledge,  no  one  has  extensively  tested  their  models 
across  the  three  popular  datasets  used  here  all  together.  The  work 
by  Murthy  (2007)  comes  close,  but  they  only  used  a  subset  of  the 
ALOl  dataset  for  recognition.  As  can  be  seen,  other  models  have 
been  able  to  achieve  superior  performance  on  specific  datasets. 
However,  it  is  important  to  note  that  all  of  the  successful  methods 
use  non-parametric  methods  for  classification  which  causes  their 
computation  time  to  grow  linearly  in  time  with  the  number  of 
training  views.  Since  the  goal  of  the  proposed  model  is  to  provide 
a  fast  first  layer  recognition  stage,  any  algorithm  containing  com¬ 
plex,  non-parametric  classifications  will  not  be  able  to  efficiently 
support  a  large  object  database.  We  propose  that  our  model  could 
easily  lend  it’s  vast  speed  improvements  by  operating  as  a  fast 
front-end  to  such  complex  algorithms,  and  leave  the  analysis  of 
such  a  hybrid  system  to  future  work. 

Although  the  model  was  tested  using  an  extensive  dataset  of  ob¬ 
jects  and  scenes,  additional  tests  using  objects  in  natural  scenes 
could  prove  useful  as  well.  However,  there  have  not  been  any  data¬ 
sets  created  thus  far  which  contain  many  objects  (in  the  order  of 
1 000)  under  systematic  variations  embedded  in  natural  scenes.^  Im¬ 
age  databases  such  as  the  LabelMe  (Russell,  Torralba,  Murphy,  & 
Freeman,  2005)  and  Caltech  101/102  (Fei-Fei,  Fergus,  &  Perona, 
2006),  do  not  provide  a  systematic  object  search,  but  are  more  con¬ 
cerned  with  general  object  search.  For  example,  finding  any  chair 
could  be  viewed  as  search,  but  requires  much  more  semantic  knowl¬ 
edge  for  the  search  (there  are  many  types  of  chair  that  could  exist).  As 
a  result,  a  broad  semantic  meaning  can  cause  great  variations  and 
ambiguity  in  the  search.  Additionally,  most  of  the  objects  that  people 
have  labeled  in  the  LabelMe  dataset  are  salient  to  begin  with  and 
would  not  greatly  benefit  from  a  biased  saliency  map  (Elazary  &  Itti, 
2008).  We  believe  that  one  of  the  strong  points  of  our  experimental 
validation  in  this  paper  is  that  it  is  very  systematic,  which  will  be 
more  difficult  to  achieve  with  these  type  of  labeled  natural  scenes. 

Looking  at  the  performance  in  the  ALOl  dataset  against  various 
transformations,  we  see  that  the  model  does  not  perform  as  well  as 
HMAX  under  the  illumination  color  condition.  This  is  mostly  due  to 
the  fact  that  the  model  considers  color  information  to  perform 
classification.  Therefore,  as  the  color  of  the  object  changes  (due 
to  the  color  of  the  light)  the  model  encounters  more  ambiguity. 
However,  such  changes  in  color  illumination  do  not  often  occur 
in  the  real  world,  and  so  we  claim  that  robustness  to  3D  transfor¬ 
mation  and  illumination  direction  are  more  desirable  features  in  a 
first  level  recognition  system. 

Looking  at  the  timing  aspects  of  the  methods  tested,  it  can  be 
seen  that  the  proposed  method,  SalBayes,  outperforms  both  SIFT 
and  HMAX  by  many  folds.  Furthermore,  the  time  requirement 
for  both  HMAX  and  SalBayes  does  not  change  significantly  with 
training  datasets  (both  decrease  as  the  amount  of  remaining  test¬ 
ing  data  decreases).  This  results  from  the  underlying  classifier  that 
is  used  to  classify  the  features.  Both  SalBayes  and  Hmax  use  a 
parametric  density  function  to  estimate  the  probability  of  the  fea¬ 
tures  belonging  to  a  particular  class.  However,  SIFT  uses  a  non- 
parametric  estimation  (/<-NN)  which  results  in  an  increase  in  the 
time  required  to  classify  a  given  feature  with  the  increase  in  train¬ 
ing  data. 

While  examining  the  performance  of  the  proposed  model,  it 
was  found  that  additional  training  examples  did  not  always  im¬ 
prove  performance.  This  can  be  attributed  to  ambiguities  devel¬ 
oped  by  modeling  each  feature  distribution  as  a  unimodal 
Gaussian.  When  too  many  training  instances  are  used,  the  actual 
distribution  of  a  feature’s  density  function  can  become  multi-mod¬ 
al,  which  can  then  be  poorly  approximated  by  the  model.  Future 


^  We  are  currently  in  the  process  of  building  an  extensive  dataset  where  the  same 
objects  are  photographed  in  different  complex  natural  backgrounds  under  different 
light  conditions  and  poses. 


work  is  planned  to  evaluate  more  advanced  PDF  representations, 
such  as  mixtures  of  Gaussians  or  particle  filters  to  try  to  accommo¬ 
date  for  such  situations.  However,  despite  these  limitations,  the 
proposed  model  has  shown  that  from  a  very  small  dimensional  fea¬ 
ture  vector  (42  dimensions)  at  a  single  location  on  an  object  (the 
most  salient  location),  the  model  was  successfully  able  to  distin¬ 
guish  among  many  objects. 

One  improvement  to  the  model  could  be  made  by  the  choice  of 
probability  distribution.  For  example,  after  examining  the  features 
of  particular  objects,  it  was  found  that  often  the  feature  distribu¬ 
tion  could  not  be  simply  modeled  using  a  single  Gaussian  model. 
That  is,  some  of  the  variations  of  particular  features  could  not  be 
explained  with  a  normal  distribution.  In  particular  the  color  (under 
the  color  illumination  changes)  and  the  orientations  (under  the 
rotation  variations).  As  a  result,  estimating  this  as  a  normal  distri¬ 
bution  would  cause  errors  in  biasing  and  classifying  the  features. 
One  explanation  for  the  shape  of  these  distributions  can  be  due 
to  the  various  ranges  of  values  for  different  objects  of  the  same 
class.  For  example,  an  object  could  contain  strong  red  features 
and  weak  red  features  depending  on  the  illumination  color. 

It  was  also  found  that  the  distributions  in  the  ALOl  dataset  often 
exhibited  two  modes  (which  were  primarily  due  to  the  changes  in 
orientations  and  changes  in  illuminations).  If  the  various  variations 
of  the  objects  can  be  modeled,  then  a  single  Gaussian  can  be  used 
to  describe  a  particular  part  of  an  object,  and  the  mixture  can  be 
used  to  describe  all  the  parts.  Therefore,  using  a  mixture  of  Gauss¬ 
ian  model  can  provide  a  better  model  of  the  probability  distribu¬ 
tion.  Training  the  mixture  of  Gaussian  can  be  achieved  by  using 
an  expectation-maximization  (EM)  algorithm.  The  drawbacks  of 
this  algorithm,  however,  are  that  it  is  an  iterative  method  and  re¬ 
quires  that  all  training  exemplars  be  available  in  each  iteration.  It 
would  be  worth  investigating  how  the  mixture  of  Gaussian  model 
can  be  learned  on-line  as  new  inputs  come  in.  One  suggested  way 
would  be  to  cluster  the  data,  extract  the  means,  and  then  learn  a 
single  Gaussian  on  the  cluster.  The  multiple  clusters  would  then 
yield  the  mixture  model. 

Eig.  10  shows  that  some  of  the  distributions  in  the  satellite 
images  house  search  could  have  been  modeled  using  a  super- 
Gaussian,  to  account  for  the  sharp  peak  in  the  distribution.  Eor 
example,  the  Laplace  or  logistic  distribution  could  have  been  used 
in  some  of  the  distributions  to  model  this  peak.  The  results  of 
which  can  improve  performance  by  not  only  increasing  the  proba¬ 
bility  around  the  mean  but  accounting  for  more  variations  by  hav¬ 
ing  a  fatter  tail.  However,  future  research  will  need  to  determine 
when  and  how  to  switch  distribution  models  and  how  will  this  ef¬ 
fect  performance  for  both  searching  and  recognition. 

Examining  the  satellite  images  search  results  (Eig.  12),  we  see 
that  the  performance  of  the  Optimal  Gains  proposed  in  Navalpak- 
kam  and  Itti  (2007)  performs  the  same  as  the  proposed  model  for 
the  first  few  houses,  but  then  loses  performance  when  attempting 
to  find  more  houses.  The  reason  is  that  the  Optimal  Gains  follows  a 
similar  structure  proposed  by  Treisman’s  Eeature  Integration  The¬ 
ory  (Treisman  &  Gelade,  1980;  Treisman  &  Sato,  1990)  and  Wolfe’s 
Guided  Search  (Wolfe,  1994)  in  which  whole  spatial  maps  of  fea¬ 
ture  detectors  are  biased  towards  the  target.  Considering  the  neu¬ 
ral  hardware  available  in  the  brain  (each  neuron  can  perform 
computations  independent  of  each  other),  it  could  be  conceived 
that  each  neuron  can  be  biased  separately,  which  is  the  approach 
we  have  chosen  to  take  in  this  paper.  Additionally,  we  bias  the  fea¬ 
ture  maps  with  more  of  a  probabilistic  approach  (applying  a  PDE 
for  each  neuron)  as  opposed  to  a  simple  gain  change.  This  would 
enable  the  system  to  bias  for  weak  features  among  strong  ones 
as  discussed  in  the  introduction  (since  applying  a  gain  would  boost 
features  and  not  suppress  them).  Erom  a  biological  aspect,  this  can 
be  seen  as  shaping  the  profiles  of  detectors  that  are  more  likely  to 
respond  to  the  target  by  shaping  their  tuning  curve  toward  the  tar- 
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get  individually,  using  prior  knowledge  about  the  object.  This  re¬ 
sults  in  great  granularity  in  the  discrimination  ability  of  the  search 
without  the  added  overhead  and  limitations  of  multiple  sub-bands. 

Additionally,  it  is  important  to  note  that  the  optimal  gains  sys¬ 
tem  as  well  as  Feature  Integration  Theory  and  Guided  Search  is 
trained  for  search  specifically,  and  does  not  uses  this  information 
for  classifying  the  object.  In  this  paper,  we  concentrated  on  the 
synergy  between  learning  the  parameters  for  the  classification, 
and  then  using  them  for  search.  However,  a  hybrid  system  could 
be  used  so  that  the  object  can  be  searched  more  efficiently  in  the 
presence  of  known  distractors.  In  particular,  using  some  of  the 
knowledge  of  the  distractors  could  help  achieve  grater  perfor¬ 
mance  under  certain  conditions. 

Lastly,  the  model  proposed  in  this  paper  works  in  situations  in 
which  the  object  can  be  described  using  few  simple  features.  For 
example,  a  house  or  a  road  can  be  described  using  simple  features. 
However,  more  complex  objects  or  scenes  would  need  multiple 
features  spanning  greater  spatial  distance  (more  than  the  fovea 
size)  to  be  described.  For  example,  an  urban  area  does  not  only 
contain  a  house  but  also  contains  multiple  houses  and  roads  (as 
seen  from  above).  As  a  result,  it  would  be  advantageous  if  more 
knowledge  can  be  added  to  the  biasing,  as  proposed  by  Navalpak- 
kam  and  Itti  (2005).  This  knowledge  would  describe  the  parts  of 
the  object,  and  its  relation  in  the  scene.  For  example,  if  the  system 
is  looking  for  a  refrigerator,  then  it  knows  that  refrigerators  are 
composed  of  doors.  In  addition,  if  the  system  is  looking  at  a  kitchen 
scene,  then  it  can  first  check  the  likely  locations  of  fridges  within 
the  scene  first.  Therefore,  the  knowledge  of  scenes  can  be  used  to 
efficiently  speed  up  the  search  in  more  complex  scenes.  This 
knowledge  can  also  boost  the  recognition  rate  by  setting  up  the 
appropriate  prior  for  the  scene.  For  example,  if  we  know  the  prob¬ 
ability  of  a  fridge  appearing  in  a  given  scenes,  ambiguities  in 
appearances  with  another  objects  (say  a  door)  can  be  resolved 
using  the  prior  information  about  the  scene.  This  knowledge  can 
be  provided  from  gist  models,  such  as  one  proposed  by  Torralba, 
Oliva,  Castelhano,  and  Henderson  (2006).  In  addition,  the  knowl¬ 
edge  base  can  be  used  to  narrow  down  the  search  for  features. 
For  example,  if  a  few  houses  are  already  encountered,  then  the  sys¬ 
tem  should  check  for  the  presence  of  trees.  Therefore,  the  next  fix¬ 
ation  should  bias  for  trees.  As  a  result,  the  system  knows  that  this 
could  not  be  an  ocean  (because  of  the  structure  in  the  knowledge 
base),  so  it  should  not  bias  for  boats.  For  a  previous  implementa¬ 
tion  of  such  a  system  (see  Navalpakkam  &  Itti,  2005). 
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Appendix  A 

AT  HMAX 

This  visual  structure  was  first  proposed  by  Riesenhuber  and 
Poggio  (1999)  and  later  improved  by  Serre  et  al.  (2005).  It  was 
dubbed  HMAX  (“Hierarchical  Model  and  X”)  and  has  drawn  its 
inspiration  from  biological  vision.  The  main  contribution  of  the 
structure  is  its  ability  to  achieve  invariance  at  the  local  level  by 


pooling  local  features  using  a  max  operator  in  both  scale  and  posi¬ 
tion.  The  whole  structure  is  built  from  two  layers,  where  the  first 
layer  extracts  Gabor  features  and  pools  them  together.  The  pooling 
first  takes  the  max  over  the  position  by  sub-sampling  the  space 
into  a  grid  size  N-band  and  then  taking  the  max  between  scales. 
The  second  layer  extracts  codewords  at  random  from  the  first  layer 
and  stores  them  in  a  database.  The  response  of  the  layer  is  then 
computed  by  a  distance  measure  between  the  memorized  patches 
and  the  current  stimulus  using  Radial  Basis  function  (RBF).  Lastly, 
an  SVM  is  used  to  classify  objects  based  on  the  features  from  the 
second  layer. 

A.2.  SIFT 

This  algorithm  has  been  proposed  by  Lowe  (2004)  and  is  known 
as  SIFT,  which  stands  for  Scale-Invariant  Feature  Transform.  The 
algorithm  first  extracts  keypoints  by  using  local  scale-space  max¬ 
ima  and  minima  of  various  Difference  of  Gaussian  (DoG)  opera¬ 
tions  applied  to  the  input  image.  This  results  in  keypoints  from 
varius  locations  and  scales  with  heigh  texture  energy.  From  these 
keypoints,  a  descriptor  vector  invariant  to  scale,  translation,  slight 
3D  rotations  and  intensity  is  created.  This  is  achieved  with  a  128 
dimensional  vector  indicating  the  gradient  locations  and  orienta¬ 
tions  using  a  histogram.  The  space  is  quantizes  into  a  4  x  4  grid 
while  the  orientations  are  quantized  into  eight  orientations.  These 
descriptor  vectors  are  store  in  a  database  for  classification. 

During  the  classification  stage,  the  same  processes  described 
above  is  used  to  extract  various  descriptor  vectors  from  a  new  im¬ 
age,  while  a  Nearest  Neighbor  algorithm  is  used  to  find  matches  in 
the  database.  Additional,  at  least  three  close  matching  keypoints 
are  required  to  match  with  an  additional  affine  constraint  (checked 
with  an  Hough  transform)  in  order  for  the  object  to  be  recognized. 

A3.  SVM 

Support  vector  machines  (SVM)  are  a  method  of  supervised 
classification  and  regression  first  proposed  by  Vladimir  Vapnik  in 
1963  for  linear  separation.  The  hypothesis  space  of  an  SVM  is  a 
set  of  hyperplanes  that  attempts  to  achieve  the  largest  distance 
to  any  sample  in  the  training  dataset  for  any  class,  which  is  known 
as  the  functional  margin.  To  handle  non-linear  classification,  SVMs 
employ  a  kernel  trick  proposed  by  Boser  et  al.  (1992),  which  first 
maps  the  data  into  a  liner  space  using  a  kernel  of  some  kind,  and 
then  performs  the  linear  separation.  Common  kernels  include  Poly¬ 
nomial,  Radial  Basis  Function  and  Gaussian  functions. 
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A  visual  attention-based  bit  allocation  strategy  for  video  compression  is  proposed.  Saliency-based  attention 
prediction  is  used  to  detect  interesting  regions  in  video.  From  the  top  salient  locations  from  the  computed 
saliency  map,  a  guidance  map  is  generated  to  guide  the  bit  allocation  strategy  through  a  new  constrained 
global  optimization  approach,  which  can  be  solved  in  a  closed  form  and  independently  of  video  frame 
content.  Fifty  video  sequences  (300  frames  each)  and  eye-tracking  data  from  14  subjects  were  collected  to 
evaluate  both  the  accuracy  of  the  attention  prediction  model  and  the  subjective  quality  of  the  encoded  video. 
Results  show  that  the  area  under  the  curve  of  the  guidance  map  is  0.773  ±  0.002,  significantly  above  chance 
(0.500).  Using  a  new  eye-tracking-weighted  PSNR  (EWPSNR)  measure  of  subjective  quality,  more  than  90% 
of  the  encoded  video  clips  with  the  proposed  method  achieve  better  subjective  quality  compared  to  standard 
encoding  with  matched  bit  rate.  The  improvement  in  EWPSNR  is  up  to  over  2  dB  and  on  average  0.79  dB. 
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1.  Introduction 

Significant  improvements  in  video  coding  efficiency  have  been 
achieved  with  modern  hybrid  video  coding  methods  such  as  H.264/ 
AVC  [1,2]  in  the  last  two  decades.  Spatial  and  temporal  redundancy  in 
video  sequences  has  been  dramatically  decreased  by  introducing 
intensive  spatial-temporal  prediction,  transform  coding,  and  entropy 
coding.  However,  to  achieve  better  compression  performance, 
reducing  such  kind  of  so-called  objective  redundancy  is  limited  and 
highly  complex  in  computation. 

On  the  other  hand,  research  on  human  visual  characteristics  shows 
that  people  only  perceive  clearly  a  small  region  of  2-5°  of  visual  angle. 
The  human  retina  possesses  a  non-uniform  spatial  resolution  of 
photoreceptors,  with  highest  density  on  that  part  of  the  retina  aligned 
with  the  visual  axis  (the  fovea),  and  the  resolution  around  the  fovea 
decreases  logarithmically  with  eccentricity  [3].  What's  more,  research 
results  show  that  observers'  scanpaths  are  similar,  and  predictable  to 
some  extent  [3].  These  research  results  provide  a  new  pathway  to 
compress  images/videos  based  on  human  visual  characteristics:  only 
encode  a  small  number  of  well  selected  interesting  regions  (attention 
regions)  with  high  priority  to  keep  a  high  subjective  quality,  while 
treating  less  interesting  regions  with  low  priority  to  save  bits. 

Recently,  many  subjective  quality-based  video  coding  methods 
have  been  developed.  According  to  the  way  of  obtaining  attention 
regions,  they  can  be  coarsely  classified  into  four  categories,  as  follows: 
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( 1 )  In  the  first  approach,  considering  that  human  attention  prediction 
is  still  an  open  problem,  human-machine  interaction  methods  are 
adopted  to  obtain  the  attention  regions.  One  example  of  online 
human-machine  interactive  methods  is  gaze-contingent  video  trans¬ 
mission,  which  uses  an  eye-tracking  device  to  record  eye  position 
from  a  human  observer  on  the  receiving  end  and  applies  in  real-time  a 
foveation  filter  to  the  video  contents  at  the  source  [4-8].  This 
approach  is  particularly  effective  because  observers  usually  do  not 
notice  any  degradation  of  the  received  frames,  since  high-quality 
encoding  continuously  follows  the  high-acuity  central  region  of  the 
observers'  foveas.  However,  this  application  is  restricted  to  specific 
cases  where  an  eye-tracking  apparatus  is  available  at  the  receiving 
end.  For  general-purpose  video  compression,  this  approach  faces 
severe  limitations  if  an  eye-tracker  is  not  available  or  several  viewers 
may  watch  a  video  stream  simultaneously.  To  address  this,  offline 
interactive  methods  are  designed  to  obtain  the  interesting  regions  by 
asking  subjects  to  manually  draw  regions  which  are  interesting,  and 
then  applying  this  to  the  encoding  procedure  [9].  (2)  The  second  class 
of  approaches  uses  machine  vision  algorithms  to  automatically  detect 
interesting  regions.  For  instance,  due  to  the  importance  of  human 
faces  while  people  perceive  the  world  [10,11],  it  is  reasonable  to 
consider  that  human  faces  may  likely  constitute  interesting  regions.  In 
[12-14],  face  regions  are  thus  defined  as  the  regions-of-interest.  Face 
detection  and  tracking  methods  are  explored  to  keep  the  interesting 
regions  focused  onto  human  faces,  and  more  resources  are  allocated 
during  encoding  to  these  face  regions,  to  keep  these  regions  in  high 
quality.  With  the  development  of  face  detection  algorithms  and  object 
tracking  methods  in  machine  vision,  this  kind  of  video  compression  is 
very  effective  in  the  occasions  where  human  faces  indeed  are  central 
to  the  visual  understanding  of  a  video  sequence,  such  as  for  video 
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telephone  or  video  conference.  However,  this  type  of  approach  is 
obviously  only  workable  when  human  faces  are  present.  For 
unconstrained  video  compression  where  there  may  or  may  not  be 
faces  in  the  streams  to  be  encoded,  this  method  will  fail  to  find 
interesting  regions.  (3)  The  third  class  of  approaches  uses  knowledge 
about  human  psychophysics  to  guide  the  encoding  process.  For 
example,  research  results  show  that  the  human  visual  system  (HVS) 
can  tolerate  certain  amounts  of  noise  (distortion)  depending  on  its 
sensitivity  to  the  source  and  type  of  noise  for  a  given  region  in  a  given 
frame.  Under  certain  conditions,  the  HVS  can  tolerate  more  distortion 
than  the  objective  distortion  measurements  such  as  mean  square 
error  (MSE)  would  predict;  on  the  other  hand,  there  are  some  types  of 
distortions  which,  despite  low  MSE,  are  vividly  perceived  and  impair 
the  viewing  experience  [15-17].  Based  on  this  theory,  many  image/ 
video  encoding  techniques  have  sought  to  optimize  perceptual  rather 
than  objective  (MSE)  quality:  these  techniques  allocate  more  bits  to 
the  image  areas  where  human  can  easily  see  coding  distortions,  and 
allocate  fewer  bits  to  the  areas  where  coding  distortions  are  less 
noticeable.  Experimental  subjective  quality  assessment  results  show 
that  visual  artifacts  can  be  reduced  through  this  approach;  however, 
there  are  two  problems:  one  is  that  the  mechanisms  of  human 
perceptual  sensitivity  are  still  not  fully  understood,  especially  as 
captured  by  computational  models;  the  other  is  that  perceptual 
sensitivity  may  not  necessarily  explain  people's  attention.  Eor 
example,  smoothly  textured  regions  and  objects  with  regular  motions 
often  belong  to  the  background  of  a  scene  and  do  not  necessarily  catch 
people's  attention,  but  these  types  of  regions  are  highly  perceptually 
sensitive  if  attended  to.  (4)  The  fourth  class  of  approaches  exploits 
recent  computational  neuroscience  models  to  predict  which  regions 
in  video  streams  are  more  likely  to  attract  human  attention  and  to  be 
gazed  at.  With  the  development  of  brain  and  human  vision  science, 
progress  has  been  made  in  understanding  visual  selective  attention  in 
a  plausible  biological  way,  and  several  computational  attention 
models  have  been  proposed  [18-20].  In  these  models,  low-level 
features  such  as  orientation,  intensity,  motion,  etc.  are  first  extracted, 
and  then  through  nonlinear  biologically  inspired  combination  of  these 
features,  an  attention  map  (usually  called  saliency  map)  can  be 
generated.  In  this  map,  the  interesting  locations  are  highlighted  and 
the  intensity  value  of  the  map  represents  the  attention  importance. 
Under  the  guidance  of  the  attention  map,  resource  can  be  allocated 
non-uniformly  to  improve  the  subjective  quality  or  save  the 
bandwidth  [21-24].  Although  such  research  shows  promising  results, 
it  is  still  not  a  completely  resolved  problem. 

Once  interesting  regions  are  extracted,  a  number  of  strategies  have 
been  proposed  to  modulate  compression  and  encoding  quality  of 
interesting  vs.  uninteresting  regions  [21,25-29].  One  straightforward 
approach  is  to  reduce  the  information  in  the  input  frames.  In  [4,21,22], 
the  frames  to  be  encoded  are  first  blurred  (foveated)  according  to  the 
attention  map.  The  foveated  image  only  keeps  the  attention  regions  in 
high  quality  while  the  other  regions  are  all  blurred.  Through  the 
blurring,  redundancy  is  reduced  significantly,  and  the  compression 
ratio  can  be  several  times  higher  than  the  normal  encoding  method. 
However,  blurring  yields  obvious  degradation  of  subjective  quality  in 
the  low  saliency  regions.  In  [23],  a  bit  allocation  scheme  through 
tuning  the  quantization  parameter  is  proposed  with  a  constrained 
global  optimization  approach.  Results  show  that  60%  of  the  test  video 
sequences  encoded  by  this  approach  have  better  subjective  visual 
quality  compared  to  the  video  encoded  by  the  normal  method  under 
the  same  bandwidth.  In  rate-distortion  optimization,  different  mode 
may  get  different  video  quality  and  bit  rate.  The  mode  decision  is 
usually  determined  by  minimize  the  cost  function  which  is  the  sum  of 
encode  error  and  bit  rate  multiple  by  a  parameter  (called  Lagrange 
multiplier).  Considering  that  the  Lagrange  multiplier  will  affect  the 
mode  decision  in  rate-distortion  optimization,  a  Lagrange  multiplier 
adjustment  method  is  explored  in  [25].  An  optimized  rate  control 
algorithm  with  foveated  video  is  proposed  in  [26],  and  foveal  peak 


signal-to-noise  ratio  (EPSNR)  is  introduced  as  subjective  quality 
assessment.  In  [28],  a  region-of-interest  based  resource  allocation 
method  is  proposed,  in  which  the  quantization  parameter,  mode 
decision,  number  of  referenced  frames,  accuracy  of  motion  vectors, 
and  search  range  of  motion  estimation  are  adaptively  adjusted  at  the 
macroblock  (MB)  level  according  to  the  relative  importance  (obtained 
from  the  attention  map)  of  each  MB. 

How  to  evaluate  the  quality  of  a  compressed  image/video  is  still  an 
open  problem.  Many  quality  assessment  metrics  have  been  developed 
to  evaluate  the  objective  or  subjective  quality  of  video.  Among  them, 
MSE  and  PSNR  are  two  widely  adopted  objective  quality  measure¬ 
ments,  even  though  they  often  are  not  consistent  with  human 
perception.  Many  additional  types  of  objective  (including  human 
vision-based  objective)  quality  assessment  methods  have  been 
proposed  [26,30-32].  However,  the  research  results  of  the  video 
quality  experts  group  (VQEG)  show  that  there  is  no  objective 
measurement  which  can  reflect  the  subjective  quality  in  all  condi¬ 
tions  [33].  The  suggested  subjective  quality  from  VQEG  was  obtained 
by  using  the  mean  opinion  score  (MOS)  from  pool  of  human  subjects. 
Specifically,  subjective  quality  scales  ranging  between  excellent,  good, 
fair,  poor  and  bad  (weight  values  are  5, 4, 3, 2,  and  1,  respectively)  can 
be  obtained  from  naive  observers,  and  the  weighted  mean  MOS  score 
can  be  used  as  the  subjective  quality. 

In  this  paper,  we  use  a  neurobiological  model  of  visual  attention, 
which  automatically  selects  (predicts)  high  saliency  regions  in 
unconstrained  input  frames  to  generate  a  saliency  map  (SM). 
Considering  the  human's  foveated  retina  characteristic,  a  guidance 
map  (GM)  is  generated  by  finding  the  top  salient  locations  in  the 
saliency  map.  The  GM  is  then  used  to  guide  the  bit  allocation  in  video 
coding  through  tuning  the  quantization  parameters  in  a  constrained 
optimization  method.  The  overview  of  the  proposed  method  can  be 
seen  in  Pig.  1.  Eor  experimental  validation,  50  high-definition 
(1920x1080)  video  sequences  were  captured  using  a  raw  uncom¬ 
pressed  video  camera,  which  include  scenes  at  a  library,  pool,  road 
traffic,  gardens,  a  dinner  hall,  lab  rooms,  etc.  Instead  of  using  a 
subjective  rating  method,  an  eye-tracking  experiment  which  records 
human  subjects'  eye  fixation  positions  over  the  video  frames  was 
conducted  to  validate  both  the  attention  prediction  model  and  the 
compressed  video  subjective  quality.  The  focus  of  this  paper  is  to 
combine  the  attention  model  with  the  latest  video  compression 
framework,  and  to  validate  the  result  in  a  quantitative  way  through  an 
eye-tracking  approach.  The  experiment  results  show  that  the  proposed 
method  is  effective  in  both  predicting  human  attention  regions  and 
improving  subjective  video  quality  while  keeping  the  same  bit  rate. 

The  present  paper  complements  our  previous  work  [21  ],  in  which 
we  showed  that  a  saliency  map  model  can  predict  human  gaze  well 
above  chance,  and  can  be  used  to  guide  video  compression  through 
selective  blurring  of  low-salience  image  regions.  The  key  innovation 
in  the  present  work  is  to  replace  the  selective  blurring  step,  which 
yields  quite  obvious  distortions  in  low-salience  video  regions,  with  a 
more  sophisticated  and  more  subtle  localized  modulation  of  the 
H.264  encoding  parameters.  Our  new  algorithm  employs  a  con¬ 
strained  global  optimization  approach  to  derive  the  encoding 
parameters  at  every  location  in  every  video  frame.  We  find  that 
the  optimization  can  be  solved  in  closed  form,  which  gives  rise  to  an 
efficient  implementation.  This  new  optimization  approach  is  an 
important  step  as  it  yields  encoded  videos  that  subjectively  look  very 
natural  and  are  not  degraded  by  blurring.  Eurther,  we  develop  and 
test  a  new  eye-tracking  weighted  PSNR  (EWPSNR)  measure  of 
subjective  quality.  Using  this  measure,  we  find  that  videos 
compressed  with  the  proposed  technique  have  better  EWPSNR  on 
our  test  video  clips.  Because  our  proposed  method  is  purely 
algorithmic,  requires  no  human  intervention  or  parameter  tuning, 
is  applicable  to  a  wide  variety  of  video  scenes,  and  yields  improved 
EWPSNR,  we  suggest  that  it  could  be  integrated  to  future  genera¬ 
tions  of  general-purpose  video  codecs. 
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Fig.  1.  Overview  of  the  model.  For  the  attention  model  path,  the  current  input  frame  is  first  decomposed  into  multi-scale  analysis  with  channels  sensitive  to  low-level  visual  features 
(two  color  contrasts:  blue-yellow,  red-green;  temporal  flicker;  intensity  contrast;  four  orientations,  0°,  45°,  90°,  and  135°;  and  for  directional  motion  energies,  up,  right,  down,  and 
left).  The  saliency  map  is  obtained  after  within-channel  within-scales  and  cross-scales  nonlinear  competition.  Assuming  that  the  top  salient  locations  in  the  saliency  map  are  likely  to 
attract  attention  and  gaze  of  viewers,  a  guidance  map  is  generated  by  foveating  these  positions.  On  the  compression  path,  the  current  macroblocks  (MBs)  are  predicted  by  previous 
encoded  frame  MBs  through  intra  (which  means  the  prediction  result  is  generated  from  the  current  frame)  or  inter  (which  means  the  prediction  result  is  generated  from  the 
previous  frame)  mode.  The  prediction  error  (difference)  is  then  passed  through  transform  and  quantization;  here  the  generated  guidance  map  is  used  to  adjust  the  quantization 
parameters  to  realize  the  non-uniform  bit  allocation.  An  encoded  frame  is  complete  after  quantization  and  entropy  encoding. 


2.  Method 

2A.  Attention  model 

The  model  computes  a  topographic  saliency  map  which  indicates 
how  conspicuous  every  location  in  the  input  image  is.  We  used  the 
freely  available  implementation  of  the  Itti-Koch  saliency  model 
[34].  In  this  model,  an  image  is  analyzed  along  multiple  low-level 
feature  channels  to  give  rise  to  multi-scale  feature  maps,  which 
detect  potentially  interesting  local  spatial  discontinuities  using 
simulated  center-surround  neurons.  Twelve  feature  channels  are 
used  to  simulate  the  neural  features  which  is  sensitive  to  color 
contrasts  (red/green  and  blue/yellow),  temporal  intensity  flicker, 
intensity  contrast,  four  orientations  (0°,  45°,  90°,  and  135°)  and  four 
oriented  motion  energies  (up,  down,  left,  and  right).  The  particular 
low-level  features  extracted  here  have  been  shown  to  attract 
attention  in  humans  and  monkeys,  as  had  been  previously  inves¬ 
tigated  in  details  [19,35,36].  Center-surround  scales  are  obtained 
from  dyadic  pyramids  with  9  scales,  from  scale  0  (the  original 
image)  to  scale  8  (the  image  reduced  by  factor  to  2^  =  256  in  both 
the  horizontal  and  vertical  dimensions).  Six  center-surround 
difference  maps  are  then  computed  as  point-to-point  difference 
across  pyramid  scales,  for  combination  of  three  center  scales 
(c  =  {2,3,4})  and  two  center-surround  scale  differences 
(8  =  {3,4}).  Thus,  six  feature  maps  are  computed  for  each  of  the 
12  features,  yielding  a  total  of  72  feature  maps.  Each  feature  map  is 
additionally  endowed  with  internal  dynamics  that  provide  a  strong 
spatial  within-feature  and  within-scale  competition  for  activity, 
followed  by  within-feature,  across-scale  competition  [37].  In  this 
way,  initially  possibly  very  noisy  feature  maps  are  reduced  to  sparse 
representations  of  only  those  locations  which  strongly  stand  out 
from  their  surroundings.  All  feature  maps  finally  contribute  to  the 
unique  scalar  saliency  map,  which  represents  visual  conspicuity  of 
each  location  in  the  visual  field. 

After  the  saliency  map  is  computed,  a  small  number  of  discrete 
virtual  foveas  endowed  with  mass/spring/friction  dynamics  attempt 
to  track  a  collection  of  most  salient  objects,  using  proximity  as  well 
as  feature  similarity  to  establish  association  between  n  salient 
locations  and  p  fovea  centers  (similar  to  the  approach  described  in 


our  previous  work  [21,22]).  The  association  is  established  through 
an  exhaustive  scoring  of  all  nxp  possible  parings  between  a  new 
salient  location  Xf(i)  =  (xf(i),  y^^(i)),  ie{i...n}  and  an  old  foveation 
center  XtU)  =  (^t(j),  ytU))J^{^-P]  time  t.  (Typically,  p  is  fixed 
and  n  =  p  -h  4  to  ensure  robustness  against  varying  saliency  ordering 
from  frame  to  frame,  p  =  10  in  the  present  implementation).  Four 
criteria  are  included  to  determine  the  correspondence:  (1)  Euclid¬ 
ean  spatial  distance  between  the  locations  of  i  and  j;  (2)  Euclidean 
distance  between  feature  vectors  extracted  at  the  locations  of  i  and  j 
which  coarsely  capture  the  visual  appearance  of  each  of  the  two 
locations;  (3)  a  penalty  term  |z— j|that  discourages  permuting 
previous  pairings  by  encouraging  a  fixed  ordered  pairing;  and  (4) 
a  tracking  priority  that  increases  with  salience,  enforcing  strong 
tracking  of  only  very  salient  objects.  Combining  these  criteria  tends 
to  assign  the  most  salient  object  to  the  first  fovea,  the  second  most 
salient  object  to  the  second  fovea,  etc.  Video  compression  priority  at 
every  location  is  then  related  to  the  distance  to  the  closest  fovea 
center  (using  a  2D  chamfer  distance  transform).  For  implementation 
details,  please  refer  to  [21]  and  [34].  It  is  important  to  note  that  the 
dynamics  of  the  virtual  foveas  do  not  attempt  to  emulate  human 
saccadic  eye  movement  but  track  salient  objects  in  a  smooth  and 
damped  manner.  The  adopted  correspondence  and  tracking  algo¬ 
rithm  compromises  between  reliably  tracking  the  few  most  salient 
objects,  and  time-sharing  remaining  foveas  among  a  larger  set  of  less 
salient  objects. 

2.2.  Bit  allocation  strategy 

Assume  that  the  rate-distortion  (R-D)  function  is  as  follows  [23,38] 
for  a  given  region  (typically,  macroblock)  i  in  an  image: 

Di(Ri)  =  (1) 

in  which  Df  denotes  the  mean  square  error,  Ri  stands  for  the  bit  rate, 
and  is  a  measurement  of  the  variance  of  the  encoding  signal  (both 
spatial  and  temporal)  and  describes  the  complexity  of  the  video 
content,  7  is  a  constant  coefficient.  This  approach  assumes  that  the 
distortion  is  to  be  computed  in  a  uniform  manner,  i.e.,  distortion  in 
different  areas  of  an  image  is  equally  important.  However,  if  we  take 
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the  human's  visual  spatial  resolution  into  consideration,  then  the 
encoding  distortion  in  area  i  can  be  written  as  follows: 

D'i  =  Wi*Di  (2) 

here  W/  is  the  weight  coefficient,  it  stands  for  the  human's  spatial 
resolution  in  area  i.  In  the  area  around  the  center  of  gaze  (the  fovea), 
Wi  should  be  higher  than  in  areas  far  from  the  gaze  position,  because 
even  small  distortions  in  the  foveal  region  can  cause  people's 
awareness  while  in  peripheral  regions  relatively  larger  distortions 
may  not  catch  people's  attention. 

According  to  this  non-uniform  distribution  of  human  eye  resolu¬ 
tion,  there  are  two  ways  to  optimize  the  bit  allocation  in  encoding. 
One  is  as  proposed  in  [23]:  keep  the  sum  of  bit  rate  as  a  constant  value 
and  maximize  the  subjective  quality.  Under  the  hypothesis  that  are 
equal  at  every  location,  the  conclusion  is  then  that  the  quantization 
parameter  QPi  is  inverse  exponentially  with  the  optimized  bit  rate: 


compute  the  optimum  parameters.  Furthermore,  from  the  Eq.  (8)  we 
can  see  that  the  optimized  distortion  should  be  inversely  proportional 
to  the  weight  coefficient.  One  special  condition  is  when  all  weights  are 
equal  over  all  locations,  in  which  case  the  distortion  should  be  equally 
distributed. 

Now  we  can  determine  the  bit  allocation  strategy  with  the 
calculated  distortion  distribution.  In  mainstream  video  compression 
schemes  developed  so  far,  the  distortion  stems  only  from  quantiza¬ 
tion.  The  basic  quantization  operation  is  as  follows: 

y  =  round  (x/astep)  (9) 

where  X  usually  is  the  coefficient  after  the  transform  (DCT,  DWT,  etc.), 
Qstep  is  the  quantization  step  and  Y  is  the  quantized  result.  The 
Qstep-distortion  mapping  is  linear  and  we  can  simply  write  the 
Qstep-distortion  (Q-D)  model  as  follows: 


Sr  \og^{i  =  1,2...N) 


(3) 


a-Ri 

QP.  = 


(4) 


where  5  is  the  area  size  of  the  entire  frame,  5/  is  the  area  of  region  i,  N  is 
the  region  number  in  one  frame,  a  accounts  for  overhead  bits  and  /3  is 
the  adjustment  parameter.  However,  in  reality,  O/^are  quite  different 
over  space  within  one  frame  and  it  is  hard  to  determine  the  para¬ 
meters  correctly.  In  [23],  the  parameters  are  calculated  from  training 
videos  that  are  similar  to  those  to  be  encoded  by  the  system,  which  is 
time  consuming  and  may  be  unreliable  if  the  test  set  differs 
substantially  from  the  training  set.  To  avoid  this,  we  take  a  different 
approach:  preserve  the  subjective  quality  while  minimizing  the  bit 
rate.  With  this  method,  we  find  that  the  optimized  distortion 
distribution  is  independent  of  the  video  frame  contents  and  only 
depends  on  the  weight  coefficients.  The  details  of  this  new  method  are 
described  as  follows: 

To  minimize  the  bit  rate  while  keeping  the  subjective  quality  the 
same,  we  can  write  this  global  optimization  problem  as  follows: 


MinESi%/S 

=  D 


(5) 


here  W  is  the  sum  of  all  of  the  weight  coefficients  in  different  areas,  D 
is  the  target  distortion.  With  the  Lagrange  multiplier  method  we  can 
solve  this  equation  in  closed  form: 


f/(D,,D2,...DN)  =  ESi*Ri/S  +  X(Ew,.*Di/W-D) 
i  i 

R,  =  ^(lOgCT^-logDi) 


(6) 


in  which  N  is  the  number  of  areas  in  the  encoded  image.  To  obtain  the 
minimum  value,  we  pose  that,  at  the  minimum: 


2L 

aoi 


2L 

dD2 


2L 

dDiu 


=  0 


(7) 


D,  = 


Solving  these  equations  above,  we  obtain: 
W*s,- 


w,*S 


*D  (I  =  1,2,  ...,N) 


(8) 


D  =  (10) 

where  k  is  a  constant  coefficient  related  to  the  video  content.  Then  the 
optimized  distortion  for  each  area  transformed  to  the  optimized 
quantization  step  is,  at  every  location  i: 

W*s,- 

Qisrep=^Qsrep  (11) 


The  formula  above  shows  that  in  the  human  visual  characteristic 
based  video  coding,  the  quantization  step  should  be  inversely 
proportional  to  the  subjective  quality  weight  coefficient. 

According  to  the  analysis  above,  we  can  apply  the  GM  to  guide  the 
quantization  parameter  adjustment  to  conduct  the  optimized  bit 
allocation.  The  GM  values  can  be  seen  as  the  subjective  weight 
coefficients  and  the  quantization  parameters  then  can  be  computed 
from  above  formula. 

3.  Video  acquisition 

Fifty  video  clips  ( 1 920  x  1080)  were  collected  for  this  experiment 
and  each  of  them  was  cut  to  300  frames  (Fig.  2).  All  these  clips  were 
captured  by  a  Silicon  Imaging  SI-1920HD  camera  with  an  EPIX  E4 
frame  grabber  card  at  frame  rate  of  30  ±  0.2  fps.  The  original  frames 
were  captured  as  Bayer  format  without  any  compression  and  saved  in 
round-robin  onto  4  separate  hard  drives  to  avoid  limitations  in  frame 
rate  due  to  limited  disk  bandwidth.  The  clips  were  captured  around 
the  use  campus  and  include  both  outdoor  and  indoor  scenes  at 
daytime.  The  outdoor  scenes  include  library,  pool,  traffic  road, 
gardens,  museum,  park,  gates,  fountains,  square,  lawn,  track  &  field, 
and  the  indoor  scenes  include  dinner  hall,  lab  rooms,  etc.  After  video 
capture,  all  the  frames  were  converted  to  RGB  color  images  through 
linear  interpolation  [39]  and  enhanced  by  gamma  correction.  Finally 
the  frames  were  assembled  into  video  clips  in  the  YUV  (4:2:0)  format 
for  further  processing. 

To  facilitate  future  research,  this  raw  uncompressed  video  dataset, 
as  well  as  all  the  eye-tracking  data  described  below,  are  made  freely 
available  on  the  Internet  (http://iLab.usc.edu/vagba/).  We  hope  that 
this  comprehensive  dataset  and  the  associated  human  eye  move¬ 
ments  can  benefit  a  number  of  research  projects  aiming  at  improving 
video  compression  quality,  and  beyond. 

4.  Human  eye-tracking 


We  can  see  from  this  equation  that  the  optimum  D/  is  independent 
of  the  video  characteristic  related  parameters  y  and  O/.  Hence  the 
optimum  process  can  be  applied  to  any  video  no  matter  what  the 
content  of  the  video  is,  and  we  need  not  train  on  any  sample  videos  to 


The  collected  50  uncompressed  YUV  format  video  clips  were 
presented  to  14  subjects  and  their  eye  fixation  points  were  recorded 
over  frames  from  each  clip  by  an  eye-tracker  machine.  The  recorded 
eye  traces  represent  the  subjects'  shifting  overt  attention,  thus  the 
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Fig.  2.  Example  of  captured  frames,  first  row:  danceOI,  seagullOI,  and  gardenOO,  second  row:  road02,  fountainOI  and  robarmOI,  third  row:  parkOI,  gate03  and  lotOI,  fourth  row: 
foutainOS,  room02  and  fieldOG. 


eye-tracking  data  are  qualified  to  validate  the  performance  of  the 
attention  prediction  model  and  the  visual  subjective  quality. 

Subjects  were  naive  to  the  purpose  of  the  experiment  and  had 
never  seen  these  video  clips  before.  They  were  also  naive  to  attention 
theory,  saliency  theory,  and  video  compression  theory.  They  were  USC 
students  and  staff  (7  males,  7  females,  mixed  ethnicities,  ages  22-32, 
normal  or  corrected-to-normal  vision).  They  were  instructed  to  watch 
the  video  clips  without  any  specific  task,  and  to  attempt  to  follow 
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whatever  interesting  things  they  might  like.  Later  they  were  asked 
some  general  questions  about  what  they  had  watched.  The  motivation 
of  these  instructions  was  to  try  to  make  the  experiment  similar  to 
ordinary  video  watching.  We  believe  that  our  instructions  did  not  bias 
subjects  toward  low-level  salient  frame  locations  as  defined  by  the 
Itti-Koch  Model  [18]. 

Stimuli  were  presented  on  a  Sony  Bravia  XBR-111 46"  60  Hz  1080  p 
LCD-TV  display  connected  to  a  Linux  computer.  Subjects  were  seated 

rn  xio-^ 


20  60  100  140  180  ^ 


Fig.  3.  Eye  fixation  distribution  examples.  The  maps  are  histogrammed  over  10x10  image  tiles  and  normalized  to  1.  (a)  The  overall  distribution  for  all  subjects  and  clips  shows  a  bias 
toward  the  center  of  the  display,  (b)  Eye  fixation  distribution  from  one  of  the  subjects  over  all  the  clips. 
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on  an  adjustable  chair  at  a  viewing  distance  of  97.8  cm,  which 
responded  to  a  field  of  view  of  54.8  x  32.7°,  and  rested  on  a  chin-rest.  A 
nine-point  eye-tracker  calibration  was  performed  every  ten  clips. 
Each  calibration  point  consisted  of  fixating  first  a  central  cross,  then  a 
blinking  dot  at  a  random  point  on  a  3  x  3  matrix  covering  the  screen 
area.  For  each  clip,  subjects  first  fixated  a  central  cross,  pressed  a  key 
to  start,  at  which  point  the  eye-tracker  was  triggered,  the  cross 
blinked  for  1066  ms,  and  the  clip  started.  After  each  clip,  the  display 
became  grey  and  the  eye-tracker  was  disabled.  The  experiment  was 
self-paced:  the  next  clip  was  shown  when  the  subject  pressed  the 
space  button.  Every  ten  clips,  subjects  could  stretch  before  the  nine- 
point  calibration.  Stimuli  were  presented  on  a  Linux  computer,  under 
SCHED_FIFO  scheduling  (process  would  keep  100%  of  the  CPU  as  long 
as  needed)  to  guarantee  timing.  Each  uncompressed  clip 
( 1 920  X 1 080,  YUV  4:2:0  format)  was  entirely  preloaded  into  memory. 
Frame  displays  were  hardware-locked  to  the  vertical  retrace  of  the 
monitor  (one  movie  frame  was  shown  for  two  screen  retraces, 
yielding  a  playback  rate  of  30.00  fps).  Microsecond-accurate  time- 
stamps  were  stored  in  memory  as  each  frame  was  presented,  and  later 
saved  to  disk  to  check  for  dropped  frames.  No  frame  drop  ever 
occurred  and  all  timestamps  were  spaced  by  33.333  ±  0.001  ms.  Eye 
position  was  tracked  using  a  240-Hz  infrared-video-based  eye-tracker 
(ISCAN,  Inc.,  model  RK-464).  Methods  were  similar  to  previously 
described  [21].  In  brief,  this  machine  estimates  point  of  regard  (POR) 
in  real-time  from  comparative  tracking  of  both  the  center  of  the  pupil 
and  the  specular  reflection  of  the  infrared  light  source  on  the  cornea. 
This  technique  renders  POR  measurements  immune  to  small  head 
translations  (tested  up  to  10  mm  in  our  laboratory).  All  analysis  was 
performed  offline.  Linearity  of  the  machine's  POR-to-stimulus 
coordinate  mapping  was  excellent,  as  previously  tested  using  a  7  x  5 
calibration  matrix  in  our  laboratory,  justifying  a  3x3  here.  The  eye- 
tracker  calibration  traces  were  filtered  for  blinks  and  segmented  into 
two  fixation  periods  (the  central  cross,  then  the  flashing  point),  or 
discarded  if  that  segmentation  failed  a  number  of  quality  control 
criteria.  An  affine  POR-to-stimulus  transform  was  computed  in  the 
least-square  sense,  outlier  calibration  points  were  eliminated,  and  the 
affine  transform  was  recomputed.  If  fewer  than  six  points  remained 
after  outlier  elimination,  recordings  were  discarded  until  the  next 
calibration.  Otherwise,  a  thin-plate-spline  nonlinear  warping  was 
then  applied  to  account  for  any  small  residual  nonlinearity.  Data  was 
discarded  until  the  next  calibration  if  residual  errors  greater  than 
34  pixels  (about  1°  field  of  view)  on  any  calibration  point  or  17  pixels 
(about  0.5°  field  of  view)  overall  remained.  Eye  traces  for  the  ten  clips 
following  a  calibration  were  remapped  to  screen  coordinates,  or 
discarded  if  they  failed  some  quality  control  criteria  (excessive  eye- 
blinks,  motion,  eye  wetting,  or  squinting).  Calibrated  eye  traces  were 
visually  inspected  when  superimposed  with  the  clips. 


Relative  Model  Value 


(b) 


Random  Hit(%) 

Fig.  5.  Ordinal  dominance  analysis,  there  are  1,455,279  fixation  points  in  total. 

(a)  Histogram  of  guidance  map  values  at  eye  positions  and  random  locations. 

(b)  Ordinal  dominance  curve,  the  dashed  line  is  the  chance  level. 


Eye  fixation  distribution  examples  can  be  seen  in  Figs.  3  and  4. 
We  can  see  from  Fig.  3  that  the  overall  distribution  of  eye  fixations  is 
quite  strongly  center-biased  on  average,  however,  for  different 
content  clips,  the  eye  fixation  distributions  are  totally  different  and 
not  necessarily  center-biased  (Fig.  4).  This  is  important  as  it 
suggests  that  a  simplistic  saliency  map  -  which  would  simply 
mark  central  screen  regions  as  more  salient  -  may  work  well  on 
average  but  not  necessarily  for  individual  video  clips  (see  [21]  for 


Fig.  4.  Examples  of  eye  fixation  distribution  map  at  different  clips,  the  maps  are  histogrammed  over  16x16  image  tiles  and  normalized  to  1.  (a)  gate03,  (b)  parkOI,  (c)  room02, 
(d)  seagullOI. 
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further  discussion).  In  our  results  below  we  not  only  report  average 
performance  but  also  performance  on  individual  clips  and  worst- 
case  performance. 

5.  Experiment  result 

Uncompressed  videos  were  shown  to  the  subjects  and  the  eye¬ 
tracking  data  was  used  to  validate  both  the  attention  prediction 
model  and  the  attention-based  bit  allocation  scheme. 

Considering  that  a  good  attention  prediction  model  should 
output  a  model  map  which  highlights  the  eye's  fixation  point, 
differences  between  guidance  map  values  at  subjects'  gaze  targets 
and  at  randomly  selected  locations  were  quantified,  to  evaluate 
performance,  using  ordinal  dominance  analysis  [40].  Model  map 
values  at  subjects'  fixation  points  and  at  randomly  selected 
locations  were  first  normalized  by  the  maximum  value  in  the  map 
when  the  eye  fixation  occurred  (100  random  locations  are  se¬ 
lected  in  this  paper).  Then,  histograms  of  values  at  eye  positions 
and  random  locations  were  created.  Fig.  5(a)  shows  the  results  of 
the  histograms  over  all  video  clip  types  and  subjects.  It  is  easy  to 
see  from  the  figure  that  many  more  human  fixations  were  to  high 
model  salience  values  than  expected  by  chance,  which  means  the 
proposed  model  performs  better  than  the  random  model  at 
predicting  human  gaze.  The  mean  observer  value  at  the  guidance 
map  is  0.778  while  the  median  value  is  0.922,  compared  to 
mean  0.435  and  median  0.400  obtained  with  random  fixations 
(both  model  values  are  significantly  higher  than  random  value, 
p<10“^°,  considering  both  t-tests  for  mean  value  comparisons 
and  sign  tests  for  median  value  comparisons).  Fig.  6  shows  the 
example  of  frames  and  their  corresponding  guidance  maps.  The 
subjects'  eye  fixation  points  are  marked  as  small  color  patch  in 
the  frames. 

To  further  measure  the  difference  between  the  observer  and 
random  histograms,  a  threshold  was  decremented  from  1  to  0,  and 
at  each  threshold  the  percentage  of  eye  positions  and  of  random 


positions  that  were  to  a  map  value  larger  than  the  threshold 
(“hits”)  were  computed.  An  ordinal  dominance  curve  (similar  to  a 
receiver  operating  characteristic  curve)  was  created  with  “observ¬ 
er-hits”  versus  “random-hits”.  The  curve  summarizes  how  well  a 
binary  decision  rule  based  on  thresholding  the  map  values  could 
discriminate  signal  (map  values  at  observer  eye  positions)  from 
noise  (map  values  at  random  locations).  The  overall  performance 
can  be  summarized  by  the  area  under  the  curve  (AUC).  An  AUC 
area  of  0.5  stands  for  a  model  which  is  at  chance  at  predicting 
human  gaze,  while  larger  AUC  values  indicate  better  prediction 
performance.  In  our  experiment,  the  ordinal  dominance  curve  is 
plotted  in  Fig.  5(b),  and  the  AUC  value  is  0.773  ±  0.002.  As  an  upper 
bound,  inter-observer  correlations  among  humans  yield  an  AUC 
of  0.854  ±0.001. 

As  to  the  attention-based  bit  allocation  in  video  compression,  the 
latest  video  compression  standard  H.264/AVC  and  its  reference 
software  JM9.8  are  adopted  to  implement  the  experiment.  In  H.264, 
a  total  of  52  different  values  of  Qstep  are  supported  and  they  are 
indexed  by  a  Quantization  Parameter  (QP).  Qstep  increases  by  12.5% 
for  each  increment  of  1  in  QP.  In  this  paper,  QPs  are  adjusted  at  the 
MB  level,  which  means  different  QPs  are  computed  for  each  MB. 
There  are  two  reasons  for  this:  first,  in  H.264  frames  are  encoded  at 
the  MB  level,  second,  the  generated  guidance  map  has  the  same  size 
as  the  frame  size  in  MBs.  QPs  are  computed  according  to  Eq.  (11) 
where  Wi  are  replaced  by  the  corresponding  GM  values  and  Q^tep  are 
taken  from  the  baseline  QP  value.  Furthermore,  in  order  to  keep  the 
smoothness  of  perceptual  quality,  the  biggest  Qstep  is  constrained  to 
equal  or  less  than  2  times  of  the  smallest  Qstep,  this  means  the 
difference  between  QPs  in  one  frame  is  constrained  into  6.  In  the 
implementation,  the  smallest  QP  is  set  to  QPbaseiine  —  2  while  the 
biggest  QP  is  set  to  QPbaseiine  +  3. 

To  measure  the  subjective  quality  of  encoded  frames,  eye-tracking 
data  are  applied  to  compute  the  weighted  distortion.  Here  we  propose 
to  use  a  new  eye-tracking  weighted  mean  square  error  (EWMSE)  and 
eye-tracking  weighted  peak  signal-to-noise  ratio  (EWPSNR)  metrics 


Fig.  6.  Example  of  video  frames  (top)  and  the  corresponding  guidance  maps  (bottom).  Subjects'  eye  fixations  are  marked  as  small  square  color  patches  in  the  frame,  the  white 
patches  in  (b)  means  a  saccade.  (a)  Frame  from  par/<03,  (b)  frame  from  room02. 
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to  measure  subjective  quality.  The  corresponding  computation 
formulas  are  as  follows: 


1  M  N  /  ,  ,  .  2\ 

£WMS£=  - - E  E  (12) 

MN  E  E  w,/-’  ^  ^ 

X=1  3/  =  l 

£ivreNR  =  io.|os(!^)  (13) 


2TtO^Oy 


(14) 


where  /  and  l'  are  the  original  frame  and  the  encoded  frame, 
respectively,  M  and  N  are  the  frame's  height  and  width  in  pixels,  n 
is  the  bit  depth  of  the  color  component.  ^  is  the  weight  for 
distortion  at  position  (x,  y)  and  normalized  to  1]  =  MN.  is 

computed  based  on  the  subjects'  eye  fixation  position  (Xg,  ye)  from 
eye-tracking  experiment,  a^and  (jyare  two  parameters  related  to  the 
distance  and  view  angle,  usually  taken  from  fovea  size.  Here  we  use  2° 
(64  pixels)  of  view  field  as  and  Oy.  The  rationale  for  this  weighting 
formula  is  that  the  photoreceptors  in  the  human  retina  are  in  a  highly 
non-uniform  distribution:  only  a  small  region  of  2-5°  of  visual  angle 
(the  fovea)  around  the  center  of  gaze  is  captured  at  high  resolution 
and  the  resolution  falls  off  quickly  around  the  fovea  [3].  In  our 
experiment,  the  eye  fixations  are  recorded  with  a  240  Hz  eye-tracker, 
considering  that  the  video  frame  rate  is  30  Hz,  for  each  subject,  8  eye 


Table  1 

Comparison  of  EWPSNR  results  between  JM9.8  and  the  proposed  model  (VAGBA  —  visual  attention  guided  bit  allocation)  for  different  clips.  Units  in  the  table  are  in  dB.  QP  means 
baseline  quantization  parameter.  Gain  (in  dB)  is  the  improvement  of  EWPSNR  compared  with  the  standard  JM9.8  method.  Gain>0  means  that,  for  the  current  clip,  the  video 
subjective  quality  encoded  by  the  proposed  method  is  better  than  the  one  encoded  by  JM9.8. 


Clip  index 

Q.P  =  24 

Q.P  =  28 

QP  =  32 

Q.P  =  36 

JM 

VAGBA 

Gain 

JM 

VAGBA 

Gain 

JM 

VAGBA 

Gain 

JM 

VAGBA 

Gain 

1 

41.15 

42.89 

1.75 

38.66 

40.16 

1.50 

35.67 

37.49 

1.82 

32.77 

34.74 

1.97 

2 

41.36 

42.02 

0.66 

38.48 

39.27 

0.79 

35.41 

36.59 

1.18 

32.93 

33.93 

1.00 

3 

43.18 

42.99 

-0.19 

40.56 

40.56 

0.00 

38.08 

38.17 

0.09 

35.77 

35.85 

0.08 

4 

42.67 

42.83 

0.16 

40.47 

40.52 

0.05 

38.03 

38.29 

0.27 

34.92 

36.08 

1.16 

5 

41.77 

43.08 

1.31 

38.93 

40.34 

1.41 

36.03 

37.58 

1.56 

33.21 

34.84 

1.64 

6 

41.95 

42.47 

0.52 

39.25 

39.74 

0.49 

36.54 

37.06 

0.51 

33.78 

34.46 

0.69 

7 

42.76 

42.69 

-0.07 

40.06 

40.08 

0.02 

37.14 

37.49 

0.35 

34.32 

34.96 

0.65 

8 

43.35 

43.74 

0.39 

40.71 

41.20 

0.49 

38.35 

38.69 

0.34 

35.67 

36.18 

0.51 

9 

42.07 

42.58 

0.50 

39.29 

39.91 

0.61 

36.56 

37.30 

0.74 

33.97 

34.59 

0.62 

10 

41.51 

42.00 

0.49 

38.79 

39.24 

0.45 

36.06 

36.52 

0.46 

33.38 

33.84 

0.45 

11 

41.89 

42.50 

0.61 

39.34 

39.82 

0.48 

36.74 

37.23 

0.49 

34.04 

34.59 

0.56 

12 

41.30 

41.66 

0.36 

38.57 

38.88 

0.31 

35.73 

36.18 

0.45 

33.00 

33.51 

0.51 

13 

42.15 

42.76 

0.61 

38.98 

40.12 

1.14 

36.83 

37.50 

0.67 

33.68 

34.84 

1.16 

14 

40.22 

41.14 

0.92 

37.38 

38.23 

0.85 

34.57 

35.39 

0.82 

31.92 

32.66 

0.73 

15 

41.84 

42.57 

0.73 

39.14 

40.01 

0.87 

36.54 

37.42 

0.88 

33.92 

34.83 

0.91 

16 

41.22 

42.39 

1.17 

38.15 

39.56 

1.41 

35.48 

36.83 

1.36 

32.70 

34.11 

1.41 

17 

43.07 

43.45 

0.38 

40.44 

40.93 

0.49 

37.64 

38.45 

0.82 

35.24 

35.95 

0.71 

18 

41.29 

42.13 

0.84 

38.32 

39.48 

1.16 

35.64 

36.91 

1.27 

33.15 

34.40 

1.25 

19 

41.68 

42.67 

0.98 

38.72 

39.98 

1.26 

36.27 

37.32 

1.05 

33.49 

34.66 

1.17 

20 

42.73 

42.64 

-0.09 

40.20 

39.99 

-0.20 

37.55 

37.37 

-0.18 

34.98 

34.75 

-0.24 

21 

42.87 

43.39 

0.52 

40.25 

40.81 

0.56 

37.51 

38.23 

0.72 

34.91 

35.66 

0.75 

22 

42.99 

42.50 

-0.48 

40.48 

39.92 

-0.55 

37.90 

37.34 

-0.56 

35.28 

34.81 

-0.46 

23 

44.76 

45.70 

0.94 

42.57 

43.49 

0.92 

40.09 

41.09 

1.01 

37.65 

38.55 

0.90 

24 

41.69 

42.48 

0.79 

39.09 

39.83 

0.73 

36.44 

37.17 

0.72 

33.77 

34.51 

0.73 

25 

43.12 

43.55 

0.43 

40.61 

41.02 

0.41 

38.10 

38.47 

0.37 

35.49 

35.85 

0.36 

26 

42.68 

42.84 

0.16 

39.99 

40.12 

0.13 

37.18 

37.36 

0.18 

34.32 

34.62 

0.31 

27 

44.82 

45.50 

0.67 

42.75 

43.47 

0.72 

40.43 

41.20 

0.77 

37.94 

38.73 

0.79 

28 

42.45 

43.25 

0.81 

39.55 

40.76 

1.204 

37.17 

38.29 

1.12 

35.05 

35.79 

0.75 

29 

42.05 

43.10 

1.05 

39.05 

40.38 

1.33 

36.16 

37.68 

1.52 

33.41 

34.98 

1.57 

30 

42.18 

42.34 

0.16 

39.52 

39.58 

0.06 

36.93 

36.96 

0.03 

34.18 

34.38 

0.21 

31 

43.66 

45.95 

2.29 

41.72 

43.94 

2.22 

38.84 

41.61 

2.77 

36.34 

39.12 

2.77 

32 

41.15 

42.55 

1.40 

38.38 

39.77 

1.39 

35.28 

37.02 

1.74 

32.25 

34.28 

2.03 

33 

42.22 

43.04 

0.83 

39.48 

40.39 

0.91 

36.40 

37.79 

1.39 

33.86 

35.23 

1.37 

34 

42.69 

43.36 

0.67 

40.21 

40.79 

0.58 

37.72 

38.27 

0.55 

34.95 

35.84 

0.88 

35 

44.65 

45.38 

0.74 

43.08 

43.61 

0.54 

41.01 

41.48 

0.47 

38.75 

39.21 

0.46 

36 

42.15 

42.68 

0.53 

39.17 

40.01 

0.84 

36.35 

37.38 

1.03 

33.33 

34.79 

1.46 

37 

42.49 

42.84 

0.35 

39.73 

40.13 

0.40 

36.97 

37.50 

0.53 

34.26 

34.90 

0.64 

38 

42.56 

43.37 

0.81 

40.08 

40.82 

0.74 

37.54 

38.27 

0.73 

34.85 

35.71 

0.87 

39 

42.00 

42.04 

0.03 

39.49 

39.36 

-0.13 

36.92 

36.77 

-0.15 

34.42 

34.22 

-0.20 

40 

41.77 

42.79 

1.02 

38.74 

40.08 

1.35 

35.74 

37.46 

1.72 

32.95 

34.84 

1.89 

41 

42.73 

43.17 

0.44 

39.79 

40.48 

0.69 

37.00 

37.88 

0.88 

34.43 

35.30 

0.87 

42 

41.86 

42.62 

0.76 

39.29 

39.98 

0.69 

36.74 

37.40 

0.66 

34.23 

34.91 

0.68 

43 

43.61 

44.05 

0.44 

40.82 

41.54 

0.72 

38.22 

39.02 

0.79 

35.64 

36.56 

0.93 

44 

42.79 

43.67 

0.88 

40.46 

41.12 

0.65 

38.59 

38.53 

-0.06 

35.80 

36.03 

0.23 

45 

42.79 

42.84 

0.05 

40.22 

40.24 

0.02 

37.41 

37.63 

0.223 

34.51 

35.08 

0.56 

46 

41.63 

42.74 

1.10 

38.97 

39.97 

1.00 

36.26 

37.26 

1.01 

33.37 

34.59 

1.22 

47 

42.14 

43.64 

1.50 

39.23 

41.05 

1.81 

36.74 

38.47 

1.73 

34.17 

35.95 

1.79 

48 

42.51 

42.74 

0.23 

39.89 

40.14 

0.25 

37.15 

37.50 

0.36 

34.27 

34.90 

0.62 

49 

41.54 

43.07 

1.53 

38.65 

40.31 

1.66 

35.58 

37.58 

2.01 

32.57 

34.83 

2.26 

50 

42.48 

43.81 

1.33 

40.16 

41.31 

1.15 

37.02 

38.77 

1.75 

34.42 

36.22 

1.79 

Average 

42.36 

43.04 

0.68 

39.71 

40.44 

0.73 

37.03 

37.85 

0.82 

34.35 

35.27 

0.92 
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fixation  points  need  to  be  taken  into  account  for  each  frame. 
Therefore,  the  weight  ^  in  reality  is  a  combination  of  all  8  different 
eye  fixation  points.  Furthermore,  the  saccade  data  are  not  considered 
in  computing  the  EWPNSR  and  only  the  fixation  points  are  taken  into 
account.  We  did  this  because  human  take  saccade  very  quickly  and  do 
not  pay  much  attention  to  the  saccade  regions.  The  mean  EWPSNR 
from  all  the  subjects  is  adopted  as  the  measurement  to  evaluate  the 
video  subjective  quality:  the  higher  EWPNSR  value,  the  better 
subjective  quality. 

To  show  the  effectiveness  of  the  proposed  visual  attention  guided 
bit  allocation  method  in  improving  the  video  subjective  quality,  we 
compare  the  encoded  video  EWPSNR  from  the  proposed  method  and 
the  standard  method  in  JM9.8  with  matched  bit  rate  through  the 
frame-level  rate  control  algorithm.  The  configuration  of  the  encoder  is 
as  follows:  intra  period  =  30,  Hadamard  transform,  UVLC,  no  fast 
motion  estimation,  no  B  frame,  high  complexity  RDO  mode,  no 
restriction  in  search  range.  We  test  4  baseline  QPs  (initial  QP):  24,  28, 
32  and  36,  and  the  bit  rates  range  from  260  Kbps  to  10  Mbps.  The  rate- 
controlled  bit  rates  with  the  standard  encoder  precisely  match  the  bit 
rate  with  the  proposed  new  encoder  (within  1%  difference).  Table  1 
lists  all  the  EWPSNR  results  from  proposed  method,  the  results  from 
the  JM9.8  standard  method,  and  the  subjective  quality  improvement 
(gain).  Better  EWMSE  and  EWPSNR  is  expected  to  be  obtained  for  our 
method  vs.  JM  only  if,  on  average,  the  predictions  of  the  saliency 
model  agree  with  where  humans  look,  and,  when  they  disagree,  the 
higher  distortions  that  our  system  will  introduce  at  the  locations 
looked  at  by  humans  do  not  outweigh  the  lower  distortions  obtained 
when  the  model  is  correct.  In  addition.  Pig.  7  plots  the  results  at 
different  baseline  QP,  and  sorts  the  results  according  to  improvement. 
It  is  easy  to  see  that  only  for  a  few  (3-4)  clips  the  subjective  quality  is 
worse  with  our  proposed  method  than  with  the  standard  method, 
while  most  of  the  clips  achieve  a  better  subjective  quality,  with 
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PSNR  at  fixation  location 


Fig.  8.  Comparison  of  histograms  of  PSNR  results  at  eye  fixation  regions  with  standard 
JIVI9.8  method  and  the  proposed  VAGBA  method  (initial  Q.P  =  28).  The  fixation  regions 
are  used  2°  of  view. 


improvement  for  some  of  them  up  to  about  2  dB  EWPSNR.  Thus,  the 
proposed  scheme  can  significantly  (p<  0.002,  t-test)  achieve  better 
subjective  quality  (as  defined  by  our  EWPSNR  measure)  while 
keeping  the  same  bit  rate.  Purthermore,  the  comparison  of  histograms 
of  PSNR  results  at  eye  fixation  regions  (2°  of  visual  field)  with  different 
methods  is  plotted  in  Pig.  8,  from  the  figure  it  is  easy  to  see  that  more 
encoded  frames  have  higher  PSNR  in  the  fixation  region  with  the 
proposed  method  compared  with  the  standard  JM9.8  method.  Pig.  9 
shows  two  examples  of  EWPSNR  over  the  clip  frames.  We  can  see 


Fig.  7.  EWPSNR  results  comparison  between  our  proposed  VAGBA  model  and  JM9.8  at  the  same  bit  rate,  the  horizontal  axis  represent  the  clip  index,  after  sorting  by  the  subjective 
quality  improvement  (the  red  bars  shown  in  each  plot). 
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Fig.  9.  Two  examples  of  EWPSNR  over  the  clip  frames,  (a)  gardenOQ  from  subject  KC,  initial  QP  =  28;  (b)  park03  from  subject  SH,  initial  QP  =  28.  Dashed  grey  lines  indicate  intra- 
coded  frames  (every  30  frames  with  our  codec  settings). 


from  the  figure  that  for  most  frames,  the  subjective  quality  of  the 
proposed  method  is  better  than  the  standard  JM9.8  encoded  result. 
However,  for  some  frames,  the  standard  method  achieves  better 
subjective  quality.  There  are  mainly  two  reasons  for  this:  first,  if  the 
current  frame  is  an  intra  refresh  frame,  the  rate  control  algorithm  in 
H.264  usually  assigns  a  relatively  smaller  QP  to  such  intra  frame  to 
achieve  better  prediction  results  for  later  P  frames.  In  these  cases,  the 
current  frame's  quality  could  be  better  than  with  the  proposed 
method.  Second,  the  attention  prediction  model  is  not  guaranteed  to 
always  accurately  predict  human's  attention  regions  for  all  the  frames, 
such  that  if  the  prediction  failed  for  the  current  frame,  then  due  to  the 
bit  allocation  strategy,  the  true  attention  region  will  receive  fewer  bits 
to  encode  thus  will  make  the  subjective  quality  worse. 

Furthermore,  the  comparison  between  the  proposed  method  and 
our  previous  method  proposed  in  [21  ]  which  guide  video  compression 
through  selective  blurring  (foveation)  of  low-salience  image  regions 
is  conducted.  The  foveated  clips  are  encoded  by  JM9.8  with  matched 
bit  rate  through  the  frame-level  rate  control  algorithm  (within  1% 
difference).  Fig.  10  shows  the  comparison  results  at  different  baseline 
QP,  and  sorts  the  results  according  to  improvement  of  EWPSNR.  From 


the  figure  we  can  see  that  for  all  clips  the  subjective  quality  is 
significantly  better  with  the  proposed  method  than  with  the  foveation 
method  (p<10“^°,  t-test).  The  average  improvement  in  EWPNSR  is 
2.533  dB.  Also  we  can  see  from  the  figure  that  the  improvement  is 
higher  when  the  baseline  QP  is  lower,  this  is  because  the  foveation 
degrade  the  video  quality  more  than  the  encode  error  when  the 
quantization  step  is  small.  As  another  example.  Pig.  1 1  compares  the 
visual  qualities  of  three  partially  reconstructed  frames  among  the 
standard  rate  control  method,  the  foveation  method  and  the  proposed 
bit  allocation  method.  The  difference  frames  (encoding  error)  are  also 
plotted  to  make  the  comparison  clearer.  As  shown  in  the  figure,  the 
encoded  frame  by  the  proposed  method  has  better  visual  quality  than 
the  frame  encoded  by  standard  method  in  the  interesting  region. 

6.  Discussion 

The  proposed  attention  model  predicts  human  attention  accurate¬ 
ly  in  most  cases,  and  based  on  this,  the  bit  allocation  algorithm  can 
improve  the  subjective  visual  quality  significantly  while  keeping  the 
same  bit  rate.  The  contributions  in  this  paper  include  several  aspects: 
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Fig.  10.  EWPSNR  results  comparison  between  our  proposed  VAGBA  model  and  the  foveation  method  (mark  as  FOV  in  the  figure)  at  the  same  bit  rate,  the  horizontal  axis  represent  the 
clip  index,  after  sorting  by  the  subjective  quality  improvement  (the  green  bars  shown  in  each  plot). 


( 1 )  Combining  the  attention  prediction  model  with  state-of-the-art 
video  compression  method,  the  proposed  method  is  fully  automatic 
and  can  be  applied  to  any  kind  of  video  categories  without  any 
restriction.  In  addition  to  combining  with  the  H.264  codec,  the 
proposed  method  can  be  applied  to  any  kind  of  codec  proposed  so  far. 

(2)  A  new  bit  allocation  strategy  was  proposed  through  solving  in 
closed  form  the  constrained  global  optimization  problem.  (3)  Using 
eye-tracking  data  to  evaluate  compressed  video  frame  quality  in  a 
quantitative  way  (instead  of  subjective  rating  or  binary  selection).  The 
new  proposed  EWPSNR  subjective  quality  measurement  is  based  on 
the  human  vision's  characteristic  and  can  represent  the  non-uniform 
subjective  distortion  in  a  reasonable  way.  (4)  Collecting  a  high- 
definition  video  sequence  database  with  eye-tracking  data  and 
distributing  them  on  the  Internet.  This  dataset  can  be  used  for  video 
compression  purposes  as  well  as  attention  prediction  purposes.  Also, 
the  raw  captured  frames  are  in  Bayer  format,  which  means  this 
dataset  can  be  used  for  Bayer  format  related  image  processing 
research. 

The  target  in  this  paper  is  to  validate  the  effectiveness  in 
improving  the  subjective  quality  while  keeping  the  same  bit  rate 
and  employing  a  purely  algorithmic  method  which  does  not  require 
manual  parameter  tuning.  Thus  far,  the  bit  allocation  strategy  is  an 
open-loop  algorithm,  which  means  that  it  only  adjusts  the  bit 
allocation  according  to  the  guidance  map  and  takes  no  constrain  to 
keep  any  presumed  bit  rate.  In  our  implementation,  we  first  compress 
the  video  sequences  with  the  proposed  method,  after  that,  we  use  the 
available  rate  control  algorithm  in  JM9.8  to  match  the  bit  rate  and  use 
this  result  to  compare  with  our  proposed  method.  The  comparison  is 
reasonable  because  there  is  no  scene  change  in  the  test  sequences  and 
thus  both  bit  rate  and  visual  quality  should  not  fluctuate  too  much 
over  video  frames.  Although  the  proposed  method  is  not  bit  rate 


constrained,  it  can  be  used  in  many  applications  such  as  video  storage, 
band-free  video  stream  transmission,  etc.  However,  it  remains  a  great 
task  to  develop  a  bit  rate  constrained  bit  allocation  strategy  based  on 
the  visual  attention  model.  Another  point  which  is  not  addressed  by 
an  open-loop  algorithm  but  could  be  studied  more  closely  in  the 
future  is  how  the  algorithm  itself  may  introduce  artifacts  in  the  low- 
bit  rate  regions  of  the  compressed  video,  which  may  themselves  be 
salient  and  attract  human  attention  (see  [21]  for  more  detailed 
discussion).  This  could  be  addressed  in  future  versions  of  our 
algorithm  where  the  saliency  map  may  be  computed  on  the 
compressed  video  clips  as  well,  to  check  for  the  introduction  of 
possibly  salient  artifacts  during  compression. 

Eye-tracking  data  recorded  from  subjects  viewing  the  uncom¬ 
pressed  video  clips  are  applied  in  evaluating  the  subjective  quality. 
The  rationale  for  viewing  uncompressed  video  lies  in  two  aspects: 
first,  the  eye-tracking  traces  from  the  uncompressed  videos  show 
the  real  attention  regions  of  the  original  clips.  Second,  the  ideal 
subjective  quality  measurement  should  use  eye-tracking  data 
from  both  the  proposed  method  encoded  video  and  standard  rate- 
controlled  video.  However,  it  is  impossible  to  obtain  these  two  kinds 
of  eye-tracking  data  from  the  same  subject  without  affecting  a  priori 
knowledge:  no  matter  which  kind  of  encoded  video  is  presented  to 
subjects  first,  the  eye-tracking  data  from  the  second  presentation 
would  likely  be  affected  by  the  fact  that  subjects  have  already  seen 
essentially  the  same  clips  before.  Considering  that  artifacts  might 
attract  attention,  two  steps  are  adopted  to  reduce  the  quality 
fluctuation:  first,  both  temporal  and  spatial  smooth  operation  are 
conducted  in  computing  the  guidance  map;  Second,  the  biggest  Qstep 
is  constrained  to  equal  or  less  than  2  times  of  the  smallest  Qstep  in 
one  frame  to  keep  the  perceptual  quality,  in  the  implementation,  the 
smallest  QP  is  set  to  QPbaseiine  — 2  while  the  biggest  QP  is  set  to 
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QPbaseiine  +  3.  In  our  experiment,  we  check  the  compressed  videos 
and  found  there  is  no  big  quality  fluctuation  in  both  spatial  and 
temporal. 

EWPSNR  is  proposed  in  computing  the  subjective  quality.  Here,  we 
wanted  to  investigate  whether  we  could  test  our  algorithm  in  a  more 
objective  and  more  informative  way  than  using  subjective  quality 
ratings,  where  observers  may  not  always  be  able  to  rationalize  or 
explain  their  ratings.  Many  existing  computational  subjective  quality 
measurements  [26,41,42]  often  tend  to  rely  on  some  knowledge  about 
the  human  visual  system  to  decide  what  may  be  more  visible  or 
important  to  a  human  observer.  For  example,  see  the  measures  of 
FPNSR  (Foveated  PSNR,  [26]),  SSIM  (structural  similarity,  [41]),  and 
DVQ  (Digital  Video  Quality,  [42]).  Thus  we  have  been  concerned  that 
using  these  measures  may  be  somewhat  circular:  (1)  process  video 
images  with  some  saliency  algorithm  and  allocate  more  bits  (lower 
distortion)  to  more  salient  regions;  (2)  measure  subjective  quality 
with  an  algorithm  that  is  very  similar  to  our  saliency  computation  and 
hence  may  be  quite  strongly  correlated  with  it.  We  would  quite 
naturally  expect  good  subjective  quality.  This  is  what  prompted  us  to 
develop  the  FWPSNR  metric.  We  believe  that  it  is  an  objective  way  to 
measure  subjective  quality,  and  it  has  the  advantage  of  not  relying  on 
any  algorithmic  assumptions  regarding  how  subjective  quality  may  be 
defined.  Here  we  just  assume  that  the  locations  which  people  look  at 
are  the  ones  which  will  matter  in  terms  of  subjective  quality.  Note  that 
this  assumption  is  itself  an  imperfect  one  (subjective  quality  seems  to 
be  influenced  not  only  by  foveal  vision  but  also  by  peripheral  vision). 
But  we  believe  that  it  at  least  avoids  circularity  in  our  testing,  and  it 
also  provides  a  very  informative  assessment  of  where  the  algorithm  is 
working  (good  agreement  between  the  algorithm  and  human  gaze)  or 
failing. 

Over  the  50  tested  video  clips,  there  are  3-4  cases  in  which  the 
subjective  quality  of  clips  encoded  by  our  proposed  method  is 
worse  than  the  clips  encoded  by  the  standard  H.264  method.  The 
worst  two  clips  are  gate03  and  seagullOh  example  frames  from 
these  two  clips  can  be  seen  in  Figs.  2  and  4.  The  reason  of  the  failure 
mainly  is  that,  for  these  clips,  the  attention  prediction  model  results 
do  not  match  well  the  subjects'  attention.  In  the  proposed  attention 
prediction  model,  high  motion  regions  take  higher  saliency  value, 
however,  in  the  gate03  clip,  the  high  speed  cars  were  less 
interesting  to  our  human  observers  than  the  jogging  girl  and  the 
flags.  In  seagullOl  clip,  the  seagulls  fly  everywhere  and  the  video  is 
less  meaningful  in  content,  the  subjects'  eye-tracking  traces  are 
highly  divergent,  thus  the  proposed  attention  prediction  model 
cannot  predict  the  attention  for  all  the  subjects  accurately. 

The  proposed  attention  prediction  model  in  this  paper  purely 
depends  on  the  bottom-up  low-level  features.  These  features  are 
independent  of  the  video  contents  and  can  be  applied  to  any  kind  of 
conditions.  However,  in  many  specific  cases,  top-down  influences  can 
be  taken  into  consideration  to  improve  the  attention  prediction 
performance.  For  example,  in  teleconference  videos  or  face-oriented 
conditions,  face  information  (using,  e.g.,  face  detection  algorithms) 
can  be  added  as  an  important  factor  in  predicting  the  attention  [11]. 
In  a  specific  search  task  (person  detection),  the  saliency,  target 
features,  and  scene  context  combined  models  can  predict  94%  of 
human  agreement  [43].  Also,  considering  the  layout  information,  the 
“gist”  of  scenes  can  be  applied  to  improve  the  prediction  by  learning 
broad  scene  categories  [44,45].  All  these  top-down  factors  may 
combine  with  the  bottom-up  model  to  improve  the  attention 
prediction  performance  and  thus  improve  the  subjective  quality  of 
compression  in  specific  corresponding  conditions. 
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Saliency  and  Gist  Features  for  Target 
Detection  in  Satellite  Images 
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Abstract — Reliably  detecting  objects  in  broad-area  overhead 
or  satellite  images  has  become  an  increasingly  pressing  need, 
as  the  capabilities  for  image  acquisition  are  growing  rapidly. 
The  problem  is  particularly  difficult  in  the  presence  of  large  in¬ 
traclass  variability,  e.g.,  finding  “boats”  or  “buildings,”  where 
model-based  approaches  tend  to  fail  because  no  good  model  or 
template  can  be  defined  for  the  highly  variable  targets.  This 
paper  explores  an  automatic  approach  to  detect  and  classify  tar¬ 
gets  in  high-resolution  broad-area  satellite  images,  which  relies 
on  detecting  statistical  signatures  of  targets,  in  terms  of  a  set  of 
biologically-inspired  low-level  visual  features.  Broad-area  images 
are  cut  into  small  image  chips,  analyzed  in  two  complemen¬ 
tary  ways:  “attention/saliency”  analysis  exploits  local  features 
and  their  interactions  across  space,  while  “gist”  analysis  focuses 
on  global  nonspatial  features  and  their  statistics.  Both  feature  sets 
are  used  to  classify  each  chip  as  containing  target(s)  or  not,  using 
a  support  vector  machine.  Four  experiments  were  performed  to 
find  “boats”  (Experiments  1  and  2),  “buildings”  (Experiment  3) 
and  “airplanes”  (Experiment  4).  In  experiment  1,  14  416  image 
chips  were  randomly  divided  into  training  (300  boat,  300  non¬ 
boat)  and  test  sets  (13  816),  and  classification  was  performed  on 
the  test  set  (ROC  area:  0.977  ib  0.003).  In  experiment  2,  clas¬ 
sification  was  performed  on  another  test  set  of  11  385  chips 
from  another  broad-area  image,  keeping  the  same  training  set 
as  in  experiment  1  (ROC  area:  0.952  ib  0.006).  In  experiment  3, 
600  training  chips  (300  for  each  type)  were  randomly  selected 
from  108  885  chips,  and  classification  was  conducted  (ROC  area: 
0.922  ib  0.005).  In  experiment  4,  20  training  chips  (10  for  each 
type)  were  randomly  selected  to  classify  the  remaining  2581  chips 
(ROC  area:  0.976ib0.003).  The  proposed  algorithm  outperformed 
the  state-of-the-art  SIFT,  HMAX,  and  hidden-scale  salient  struc¬ 
ture  methods,  and  previous  gist-only  features  in  all  four  experi¬ 
ments.  This  study  shows  that  the  proposed  target  search  method 
can  reliably  and  effectively  detect  highly  variable  target  objects 
in  large  image  datasets. 

Index  Terms — Gist  features,  saliency  features,  satellite  images, 
target  detection. 

Manuscript  received  August  05,  2010;  revised  November  19,  2010;  accepted 
November  24,  2010.  Date  of  publication  December  13,  2010;  date  of  current 
version  June  17,  2011.  This  work  was  supported  by  Defense  Advanced  Re¬ 
search  Projects  Agency  under  Government  Contract  HROOll-lO-C-0034,  the 
National  Geospatial  Intelligence  Agency  under  Grant  19-1082141,  the  National 
Science  Foundation  under  CRCNS  Grant  BCS-0827764,  the  Army  Research 
Office  under  Grant  W911NF-08-1-0360,  and  the  China  Scholarship  Council 
under  Grant  2007103281.  The  associate  editor  coordinating  the  review  of  this 
manuscript  and  approving  it  for  publication  was  Dr.  Kenneth  K.  M.  Lam. 

Z.  Li  was  with  the  School  of  Automation  Science  and  Electrical  Engineering, 
Beihang  University,  Beijing,  100191  China.  He  is  now  with  the  Computer  Sci¬ 
ence  Department,  University  of  Southern  California,  Los  Angeles,  CA  90089 
USA  (e-mail:  lzcbuaa@gmail.com). 

L.  Itti  is  with  the  Computer  Science  Department,  University  of  Southern  Cal¬ 
ifornia,  Los  Angeles,  CA  90089  USA  (e-mail:  itti@usc.edu). 

Color  versions  of  one  or  more  of  the  figures  in  this  paper  are  available  online 
at  http://ieeexplore.ieee.org. 

Digital  Object  Identifier  10.1109/TIP.2010.2099128 


I.  Introduction 

VERHEAD  and  satellite  imagery  have  become  ubiq¬ 
uitous,  with  applications  ranging  from  intelligence 
gathering  to  consumer  mapping  and  navigation  assistance. 
With  the  overwhelming  amount  of  satellite  imagery  available 
today,  it  has  become  impossible  for  human  image  analysts 
to  examine  all  of  the  imagery,  in  search  of  interesting  in¬ 
telligence  information.  Thus,  there  is  a  pressing  need  for 
automatic  algorithms  to  preprocess  the  data  and  to  extract 
actionable  intelligence  from  raw  imagery,  thereby  facilitating 
and  supporting  human  interpretation.  This  paper  focuses  on 
automatically  detecting  diverse  types  of  targets  with  large 
intraclass  variability  in  satellite  images.  This  analysis  is  one  of 
the  currently  highly  time-consuming  tasks  that  image  analysts 
routinely  perform  manually.  Providing  new  means  to  automate 
this  task  is  expected  to  facilitate  and  render  more  efficient  the 
interpretation  of  satellite  image  by  human  analysts. 

The  problem  of  target  detection  is  a  difficult  challenge  in 
computer  vision  [l]-[3].  Eor  a  given  scene  (image),  the  target 
detection  task  can  be  simply  described  as  “where  is  the  target?” 
Considering  the  feature  types  used  for  detection  in  static  images, 
algorithms  for  target  detection  can  be  briefly  summarized  as 
belonging  to  three  broad  categories:  A  first,  relatively  straight¬ 
forward  approach  is  to  use  a  provided  (or  trained)  target  tem¬ 
plate  or  model  (hence,  the  feature  is  the  image  itself),  to  match 
against  targets  in  the  image  of  interest,  at  different  locations, 
orientation  and  scales  [4]-[6].  This  type  of  method  works  well 
when  the  variability  of  targets  is  small  (for  example,  detecting 
human  faces  [5],  [6]).  A  second  method  for  target  detection  is  to 
use  a  model  to  extract  a  spatially  sparse  collection  of  invariant 
structural  features  (e.g.,  keypoint  descriptors,  bags  of  features) 
of  the  target  even  when  viewpoint,  pose,  and  lighting  condi¬ 
tions  vary  [7]-[10].  In  a  third  approach,  using  knowledge  of 
target  shape  and  characteristic  geometry,  several  studies  have 
proposed  methods  which  learn  and  apply  target  geometric  con¬ 
straints  on  the  keypoint  feature  locations  [11],  [12].  In  practice, 
the  detection  algorithms  usually  overlap  these  categories,  and 
some  approaches  are  intermediate  between  the  geometry -based 
and  “bag  of  features”  approaches  retaining  only  some  coarsely- 
coded  location  information  or  recording  the  locations  of  features 
relative  to  the  target’s  center  [3],  [13].  In  addition  to  these  ma¬ 
chine  vision  approaches,  several  biologically-inspired  computa¬ 
tional  models  have  also  started  exploring  target  detection  tasks 
in  imagery,  usually  based  on  our  knowledge  of  visual  cortex, 
showing  some  promising  experimental  results  [14]-[19].  Our 
approach  extends  these  biologically-inspired  frameworks. 

Based  on  the  special  properties  of  satellite  image,  several 
algorithms  have  been  proposed  to  detect  the  targets  in  such  kind 
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of  images.  For  example,  for  hyper- spectral  satellite  images,  the 
features  applied  usually  take  advantage  of  the  reflection  charac¬ 
teristics  of  different  materials  [20]-[22]  while  for  multispectral 
images,  the  features  are  usually  extracted  from  fused  spectra 
[23],  [24].  However,  the  images  discussed  in  this  paper  focus  on 
the  visible  spectrum  and,  thus,  the  detection  methods  discussed 
in  the  previous  paragraph  are  usually  adopted.  Despite  all  the 
recent  advances  in  computer  vision  technologies,  humans  still 
perform  orders  of  magnitude  better  than  the  best  available  vision 
systems  in  object  and  target  detection,  and  for  many  target  search 
applications  humans  remain  the  gold  standard.  As  such,  it  is 
reasonable  to  examine  the  low-level  mechanisms  as  well  as  the 
system-level  computational  architecture  of  human  vision  for  in¬ 
spiration.  Early  on,  the  human  visual  processing  system  already 
makes  decisions  to  focus  attention  and  processing  resources  onto 
those  small  regions  within  the  field  of  view  which  look  more 
interesting  or  visually  “salient”  [25]-[27].  When  no  specific 
search  target,  no  search  task,  and  no  particular  time  or  other  con¬ 
straint  are  specified  to  an  observer,  bottom-up  (image-derived) 
information  may  play  a  predominant  role  in  guiding  attention 
toward  potential  generically  interesting  targets  [28].  The  mech¬ 
anism  of  selecting  a  small  set  of  candidate  salient  locations  in  a 
scene  has  recently  been  the  subject  of  comprehensive  research 
efforts  and  several  computational  models  have  been  proposed 
[29]-[34] .  One  can  make  use  of  these  models  to  predict  possible 
target  locations  and  target  distributions.  In  this  paper,  saliency 
maps  from  several  feature  channels  (intensity  contrast,  local 
edge  orientation,  etc.)  are  computed  from  a  modified  Itti-Koch 
saliency  model  [25],  [31],  [35].  Given  a  static  or  dynamic  visual 
scene,  this  model  creates  a  number  of  multiscale  topographic 
feature  maps  which  analyze  the  visual  inputs  along  visual  feature 
channels  known  to  be  represented  in  the  primate  brain  [31] 
and  thought  to  guide  visual  attention  and  search  [36]  (lumi¬ 
nance  contrast,  color-opponent  contrast,  oriented  edges,  etc.). 
Center-surround  mechanisms  and  long-range  competition  for 
salience  operate  separately  within  each  feature  channel,  coarsely 
reproducing  neuronal  interactions  within  and  beyond  the  clas¬ 
sical  receptive  field  of  early  sensory  neurons  [37],  [38].  These 
interactions  are  critical  in  transforming  raw  feature  responses 
(e.g.,  an  edge  map  computed  over  the  input  scene)  into  salient 
feature  responses,  as  they  emphasize  locations  which  are  locally 
outliers  to  the  global  statistics  of  the  scene.  As  a  result,  local 
feature  responses  (e.g.,  a  color  contrast  response  to  a  small  red 
object  in  an  image)  are  modulated  globally  depending  on  the 
entire  scene’s  content  (e.g.,  the  response  to  the  small  red  object 
might  be  inhibited  if  many  other  red  objects  are  present  in  the 
scene,  or  might  be  amplified  if  all  other  objects  in  the  scene  are 
blue).  After  these  interactions,  the  feature  maps  from  all  feature 
channels  are  combined  into  a  single  scalar  topographic  saliency 
map.  Locations  of  high  activity  in  the  saliency  map  are  more 
likely  to  attract  attention  and  gaze  [28],  [29]. 

Thus  far,  saliency-based  analysis  of  scenes  has  been  pre¬ 
dominantly  applied  to  relatively  small  images,  typically  on  the 
order  of  1  megapixel  (MP),  with  at  least  one  study  pushing  to 
24  MP  [40].  Such  smaller  images  are  coarsely  matching  the 
amount  of  information  which  might  arise  from  a  primate  retina 
(about  1  million  distinct  nerve  fibers  in  each  of  the  human  optic 
nerves).  With  larger  broad-area- search  images,  for  example 
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400  MP-1000  MP  satellite  images,  it  becomes  an  interesting 
research  question  whether  the  mechanisms  developed  by  the 
primate  brain  might  scale  up.  Here,  we  address  this  question 
by  developing  a  new  algorithm,  which  analyzes  large  images 
in  small  chips,  thus,  mimicking  the  processing  which  human 
image  analysts  might  operate  when  they  deploy  multiple  eye 
fixations  on  an  image,  analyzing  each  fixated  location  in  turn. 
A  second  important  research  question  is  whether  saliency  maps 
might  be  useful  at  all  for  object  classification,  as  opposed  to 
being  limited  to  just  attention  guidance  as  described  previously. 
Here  we  hypothesize  that,  within  each  chip,  the  chip’s  saliency 
map  may  provide  a  coarse  indication  of  the  structure  of  the 
visual  contents  of  the  chip.  Hence,  rather  than  attempting  to 
shift  an  attention  spotlight  to  different  salient  locations  within 
the  chip,  the  hypothesis  underlying  the  proposed  algorithm 
is  that  a  coarse  analysis  of  the  statistics  of  a  chip’s  saliency 
map  may  provide  sufficient  clues  for  classifying  the  chip  as 
containing  or  not  a  target.  For  example,  target  chips  might 
have  more  numerous  and  sharper  saliency  peaks  than  nontarget 
chips.  Our  experiments  and  results  test  whether  this  approach 
is  viable  for  complex  target  classification  tasks  where  the  intr¬ 
aclass  heterogeneity  is  significant  (e.g.,  find  “boats,”  ranging 
from  small  pleasure  craft  to  larger  commercial  or  military 
ships).  For  each  saliency  map,  mean,  variance,  number  of  local 
maxima,  and  average  distance  between  the  locations  of  local 
maxima  are  adopted  to  summarize  saliency  maps.  These  values 
to  some  extent  represent  the  saliency  intensity  and  the  salient 
objects’  spatial  distribution.  In  the  full  algorithm  described 
in  the  following,  all  of  these  values  from  different  feature 
channels’  saliency  maps  are  combined  together  to  form  the 
“saliency  features”  part  of  the  proposed  algorithm. 

Parallel  with  attention  guidance  and  mechanisms  for  saliency 
computation,  studies  of  scene  perception  have  shown  that  ob¬ 
servers  can  recognize  the  “gist”  of  a  real-world  scene  from  a 
single,  possibly  very  brief  glance.  For  example,  following  pre¬ 
sentation  of  a  photograph  for  just  a  fraction  of  a  second,  a  human 
observer  may  report  that  it  is  an  indoor  meeting  room  or  an  out¬ 
door  scene  of  a  beach  [41]-[45].  Such  a  report  from  the  first 
glance  onto  an  image  is  remarkable  considering  that  it  summa¬ 
rizes  the  quintessential  characteristics  of  an  image,  a  process 
previously  thought  to  require  deep  visual  and  cognitive  analysis. 
With  very  brief  exposures  (100  ms  or  below),  reports  are  typi¬ 
cally  limited  to  a  few  general  semantic  attributes  (e.g.,  indoors, 
outdoors,  playground,  mountain)  and  a  coarse  evaluation  of  the 
distributions  of  visual  features  (e.g.,  grayscale,  colorful,  large 
masses,  many  small  objects)  [46]-[48].  Gist  may  be  computed 
in  brain  areas  which  have  been  shown  to  preferentially  respond 
to  “places,”  that  is,  visual  scene  types  with  a  restricted  spatial 
layout  [49].  Like  Siagian-Itti’s  gist  formulation  in  computer  vi¬ 
sion  [50],  here  we  use  the  term  “gist”  to  represent  a  low-dimen¬ 
sional  (compared  with  the  raw  image  pixel  array)  scene  repre¬ 
sentation  feature  vector  which  is  acquired  over  very  short  time. 
In  our  target  detection  scenario,  this  feature  vector  is  computed 
for  every  image  chip,  and  we  explore  how  well  it  may  represent 
the  overall  information  of  the  chip  so  as  to  support  classification 
(e.g.,  chips  containing  boats  might  have  significantly  different 
gist  signatures  than  chips  which  do  not).  Saliency  and  gist  fea¬ 
tures  appear  to  be  complementary  opposites  [50]:  saliency  fea- 
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Fig.  1.  Diagram  of  the  image  classification  system  applied  to  every  image  chip. 


tures  tend  to  capture  and  summarize  the  intensity  and  spatial  dis¬ 
tribution  of  those  objects  within  a  chip  which  stand  out  by  being 
significantly  different  from  their  neighbors,  while  gist  features 
capture  and  summarize  the  overall  statistics  and  contextual  in¬ 
formation  over  the  entire  chip. 

Given  the  proposed  chip-based  analysis  approach,  the  task 
of  answering  “where  is  the  target?”  is  equivalent  to  answering 
“does  this  image  chip  include  the  target?”  for  every  chip  in 
a  large  image.  To  achieve  this  decision  making  task,  a  Sup¬ 
port  Vector  Machine  (SVM)  [51],  [52]  is  adopted  as  the  clas¬ 
sifier,  while  the  biologically  inspired  saliency-gist  features  are 
explored  to  form  the  feature  vector  in  the  feature  space.  The 
system  overview  diagram  can  be  seen  in  Fig.  1. 
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II.  Design  and  Implementation 

Here  we  first  describe  the  two  computational  models  pro¬ 
posed  to  compute  the  saliency  features  and  gist  features 
separately. 

A.  Saliency  Feature  Computation 

We  compute  saliency  maps  using  several  variants  of  the 
general  Itti-Koch  [31]  architecture,  and  we  then  compute 
basic  saliency  map  statistics  for  each  variant.  While  in  the 
original  model  only  simple  biological  features  (color,  intensity, 
orientation)  were  employed,  we  here  develop  several  new 
features  which  might  be  more  effective  in  supporting  the 
target/non-target  classification  task.  The  block  diagram  of  the 
proposed  model  is  shown  in  Fig.  2.  In  this  model,  an  image  is 
analyzed  along  multiple  low-level  feature  channels  to  give  rise 
to  multiscale  feature  maps,  which,  as  in  the  original  Itti-Koch 
model,  detect  potentially  interesting  local  spatial  outliers.  Ten 
feature  channels  are  adopted  in  this  paper:  intensity,  orienta¬ 
tion  (0°,  45°,  90°  and  135°,  combined  into  one  “orientation” 
channel),  local  variance,  entropy,  spatial  correlation,  T-junc¬ 
tions,  L-junctions,  X-junctions,  endpoints  and  surprise.  Note 
that  color  information  is  not  used  since  the  images  often  are 
greyscale.  Some  of  these  feature  channels  (variance,  entropy, 
spatial  correlation)  are  computed  by  analyzing  16  x  16  image 
patches,  giving  rise  to  a  map  that  is  16  times  smaller  than 
the  original  image  horizontally  and  vertically  (one  map  pixel 
per  16x16  image  patch).  The  remaining  feature  channels  are 
computed  using  image  pyramids  and  center- surround  differ¬ 
ences,  as  in  the  original  Itti-Koch  algorithm:  for  each  of  these 
feature  channels,  center- surround  scales  are  obtained  from 
dyadic  pyramids  with  nine  scales,  from  scale  0  (the  original 


Saliency 


Maps 


Saliency  Map  Feature  Extractor 


Saliency 
eature  Vector 


Fig.  2.  Block  diagram  of  the  saliency  features  computation  model  applied  to 
every  image  chip. 


image)  to  scale  8  (the  image  reduced  by  factor  to  2^  =  256 
in  both  the  horizontal  and  vertical  dimensions).  Six  center 
surround  difference  maps  are  then  computed  as  point-to-point 
difference  across  pyramid  scales,  for  combination  of  three 
center  scales  (c  =  {2,3,4})  and  two  center- surround  scale 
differences  {8  =  {3,4}).  Each  feature  map  is  additionally 
endowed  with  internal  dynamics  that  provide  a  strong  spatial 
within- scale  competition  for  activity,  followed  by  within-fea- 
ture,  across-scale  competition.  In  this  way,  initially  possibly 
very  noisy  feature  maps  are  reduced  to  sparse  representations 
of  only  those  locations  which  strongly  stand  out  from  their 
surroundings.  Feature  maps  then  contribute  additively  to  the 
corresponding  saliency  maps  (SMs)  that  represent  the  con- 
spicuity  of  each  location  in  their  channel.  Finally,  a  saliency 
map  feature  extractor  is  applied  to  summarize  each  saliency 
map  into  a  4D  vector  with  mean,  variance,  number  of  local 
maxima  and  average  distance  between  locations  of  local 
maxima.  All  those  feature  vectors  from  the  ten  model  variants 
are  combined  into  a  40D  vector  referred  to  as  the  “saliency 
features.”  More  information  about  the  model  is  described  in 
details  in  the  following. 
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Intensity  Channel:  With  the  image  chip  as  input,  nine  spa¬ 
tial  scales  are  created  using  a  dyadic  Gaussian  pyramid  [25], 
which  progressively  low-pass  filters  and  subsamples  the  input 
image,  yielding  horizontal  and  vertical  image-resolution  factors 
ranging  from  1:1  (scale  zeros)  to  1:256  (scale  nine). 

Intensity  represents  the  amount  of  light  reflected  by  the  cor¬ 
responding  point  on  the  object  in  the  direction  of  the  camera 
view  and  multiplied  by  some  constant  factor  that  depends  on 
the  parameters  of  the  imaging  system.  In  our  experiments,  the 
range  of  the  intensity  value  is  from  0  to  65  535  (16-bit  image)  or 
from  0  to  255  (8-bit  image)  for  all  images  Is  {s  =  0, 1, . . . ,  8) 
at  every  spatial  scale.  This  channel  is  essentially  as  previously 
described  [25]. 

Orientation  Channel:  Orientation  features  are  generally  very 
effective  feature  in  identifying  objects,  as  demonstrated  for  ex¬ 
ample  by  humans’  ability  to  understand  line  drawings.  Here  we 
adopt  Gabor  filters  [0]^  =  0°,  45°,  90°,  135°)  to  extract  the  ori¬ 
entation  feature.  For  each  image  I  in  the  image  pyramid,  the 
orientation  feature  maps  can  be  obtained  as  follows  [25]: 

Mo,k  =  G&hox{I,6k).  (1) 

Local  Variance  Channel:  Local  variance  channel  is  used 
to  capture  local  pixel  intensity  variance  over  16  x  16  image 
patches  of  the  image  chip  of  interest.  This  feature  is  of  interest 
here  as  it  has  previously  been  shown  to  attract  human  attention 
[53],  [54].  For  each  16  x  16  image  patch,  the  local  variance 
feature  map  can  be  computed  as  follows: 

Ssz-1 

here  Sgz  is  the  total  pixel  number  of  pixel  (7,  j)’s  neighborhood 
with  size  of  sz  (sz  =  16  x  16  in  our  implementation). 

Entropy  Channel:  Entropy  as  implemented  here  also  pro¬ 
vides  a  simple  measure  of  information  content  in  small  16x16 
image  patches.  We  follow  the  definition  proposed  by  Privitera 
and  Stark  [54]  who  showed  that  such  measure  of  entropy  also 
correlates  with  human  eye  fixations.  Note  that  many  more  so¬ 
phisticated  measures  of  entropy  could  be  computed  at  the  chip, 
image,  or  image  sequence  level,  but  this  one  has  the  advan¬ 
tage  of  being  simple  and  motivated  by  previous  human  gaze 
tracking  experiments.  In  image  processing,  entropy  always  in¬ 
dicates  the  probability  distribution  of  the  image  intensity.  The 
entropy  value  can  be  computed  with  the  formula  described  in 
the  following: 

=  -  53  pU)>^log(K/))  (3) 

leisz 


where  Igz  means  the  neighborhood  of  the  pixel  at  (7,  j)  loca¬ 
tion,  p(I)  stands  for  the  probability  of  possible  intensity  I  in  its 
neighborhood. 

Spatial  Correlation  Channel:  For  two  random  variables  X 
and  Y,  their  correlation  can  be  formulated  as 


cov(X,  Y) 
PX,Y  =  - 

cfxcfy 


E{XY)  -  E{X)E(Y) 


(4) 


Here,  spatial  correlation  is  computed  at  every  location  between 
a  local  16x16  patch  and  other  patches  at  a  given  radius  from  the 
local  patch.  It  represents  the  similarity  between  the  local  patch 
and  its  neighbors.  In  the  spatial  correlation  saliency  map,  low 
spatial  correlation  is  a  simple  measure  of  high  salience,  i.e.,  low 
similarity. 

Junction  Channels:  In  addition  to  the  Orientation  channel 
described  previously,  several  “junction”  channels  are  created  to 
further  characterize  the  edge  contents  of  image  chips.  Taking 
the  local  edge  responses  in  different  directions  over  small 
image  patches  into  consideration,  four  different  kinds  of  junc¬ 
tion  channels  are  created,  all  included  in  the  junction  saliency 
map:  L-junction,  T-junction,  X-junction  and  endpoint.  The 
L-j unction  channel  is  sensitive  to  “comer”  features:  it  responds 
at  locations  where  two  edges  meet  perpendicularly  and  end  at 
the  intersection  point.  The  T-junction  channel  responds  when 
two  edges  are  perpendicular  and  only  one  of  them  ends  at  the 
intersection  point.  Likewise,  the  difference  of  X-junction  from 
T-junction  is  that  in  X-junction  both  edges  do  not  end  at  the 
intersection  point.  Finally,  the  Endpoint  channel  responds  when 
an  extended  edge  ends.  All  junction  channels  are  computed 
using  a  common  framework  which  considers  the  collection  of 
edge  responses  from  the  four  maps  in  the  Orientation  channel, 
at  points  neighboring  the  point  of  interest. 

We  consider  the  8 -neighborhood  of  a  given  point  of  interest 
(at  a  given  scale  between  0  and  8  in  our  pyramid  framework), 
and  the  one  of  the  four  orientation  responses  at  each  of  the 
eight  neighbors  which  is  along  the  line  segment  from  the  cen¬ 
tral  point  to  the  neighbor  (e.g.,  at  the  neighbor  above  the  cen¬ 
tral  point,  the  vertical  orientation  response  is  considered;  at  the 
neighbor  to  the  left  of  the  central  point,  the  horizontal  orien¬ 
tation  response  is  considered).  The  response  characteristics  of 
a  given  junction  channel  is  then  given  by  a  disjunction  (sum) 
of  binary  response  patterns  (binary  filter  masks)  applied  to  the 
neighbors’  responses.  Eor  example,  the  T-junction  detector  will 
respond  to  1)  for  an  upright  T,  responses  to  the  left  (and  from 
the  orientation  channel  for  horizontal  orientation),  right  (hori¬ 
zontal  orientation),  and  below  (vertical  orientation)  the  point  of 
interest,  plus  2)  for  a  T  rotated  90°  clockwise,  responses  above, 
below,  and  to  the  left,  plus  3)  for  an  upside-down  T,  responses 
above,  left  and  right,  plus  4)  for  a  T  rotated  90°  counter-clock¬ 
wise,  responses  above,  below  and  to  the  right.  The  L-junction 
and  X-junction  channels  are  defined  likewise,  and  the  mask  pat¬ 
tern  for  the  endpoint  channel  is  simpler,  as  it  will  simply  require 
that  an  orientation  response  exists  on  one  side  of  the  point  of  in¬ 
terest  but  not  on  the  other  (for  example,  some  vertical  response 
above  but  none  below). 

Surprise  Channel:  We  recently  proposed  an  enhanced 
saliency  model,  which  exploits  a  new  Bayesian  definition  of 
surprise  to  predict  human  perceptual  salience  in  space  and 
time  [55]-[57].  Very  briefly,  surprise  quantifies  the  difference 
between  prior  and  posterior  beliefs  of  an  observer  as  new 
data  is  observed.  If  observing  new  data  causes  the  observer  to 
significantly  reevaluate  his/her/its  beliefs  about  the  world,  that 
observation  will  cause  high  surprise.  Surprise  complements 
Shannon’s  definition  of  information  by  emphasizing  the  effect 
of  data  observations  onto  the  internal  subjective  beliefs  of  an 
observer,  while  Shannon  information  objectively  characterizes 


LI  AND  ITTI:  SALIENCY  AND  GIST  FEATURES  FOR  TARGET  DETECTION  IN  SATELLITE  IMAGES 


2021 


the  data  itself  (in  terms  of,  e.g.,  how  costly  it  would  be  to 
transmit  from  one  point  to  another).  Here,  we  use  this  new 
model  as  well,  though  we  only  consider  the  spatial  domain 
since  all  images  are  static.  Surprise  is  then  computed  for  each 
16  X  16  image  patch  by  establishing  prior  beliefs  from  a  large 
neighborhood  of  image  patches,  and  computing  the  extent 
to  which  such  beliefs  are  adjusted  into  posterior  beliefs  after 
information  about  the  central  patch  of  interest  is  observed. 
The  surprise  map  computed  under  these  conditions  is  similar 
to  a  regular  saliency  map,  except  that  the  Bayesian  surprise 
computations  are  used  for  competition  across  space  instead 
of  the  mechanism  described  in  the  following.  The  surprise 
map  is,  thus,  an  optimized  weighted  combination  of  intensity, 
orientation  and  junction  features,  to  which  a  spatial  surprise 
detector  is  applied. 

Feature  Maps  Competition:  In  all  maps  except  surprise 
(which  has  its  own  internal  competition  dynamics),  a  feature 
map  competition  mechanism  tends  to  globally  promote  maps  in 
which  a  small  number  of  strong  peaks  of  conspicuous  locations 
is  present,  while  globally  suppressing  maps  which  contain 
numerous  comparable  peak  responses.  To  implement  this,  first 
normalize  the  feature  map  to  a  fixed  range  [0  . . .  M],  and  then 
find  the  global  maximum  value  M  and  the  average  value  fh  of 
other  local  maximums,  finally  globally  multiplying  the  map  by 
(M  —  m)^,  as  was  previously  described  in  detail  [25]. 

Saliency  Map  Feature  Extractor:  For  each  of  the  ten  variants 
of  the  model,  the  obtained  saliency  map  is  relatively  high-di¬ 
mensional  data  (for  example,  a512x512  image  chip’ s  saliency 
map  size  is  32  x  32  =  1024D),  and  this  becomes  especially  true 
when  all  ten  channels’  saliency  maps  are  combined.  To  reduce 
the  data  dimensionality  while  keeping  the  most  important  infor¬ 
mation,  we  compute  four  summary  statistic  values  to  represent 
each  saliency  map:  mean  value  standard  deviation  over 
the  saliency  map’ s  pixels,  number  of  local  maxima  (peaks)  in 
the  map,  and  the  average  Euclidean  distance  between  the  local 
maximum  points  .  The  computation  formulas  are  described 
as 

=  (5) 

Vk  =  (6) 

V 

dk  —  mean(^ {ip  {jp  Jg)^)) 

p,q  <nk,py^  q  (7) 

where  W  and  H  are  the  saliency  map  size,  and  Sz  is  the  saliency 
map’s  area,  (ip^jp)  and  (iq^jq)  are  local  maximum  points  in 
saliency  map,  subscript  k  indicates  the  saliency  map  type  (in¬ 
tensity,  orientation,  . . .).  A  rational  explanation  of  this  is  that 
the  saliency  map  describes  the  conspicuity  of  the  image  and 
only  the  most  salient  points  or  regions  will  show  on  the  saliency 
map,  therefore,  we  can  use  these  four  values  to  represent  the 
most  important  information  of  the  saliency  map.  We  may  lose 
the  salient  objects’  position  information,  however,  we  hypoth¬ 
esize  that  it  might  not  affect  the  performance  of  the  detection 
task  greatly:  the  four  statistics  should  capture  some  information 
about  the  distribution  of  salient  objects  in  the  image  chip,  no 
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Fig.  3.  Block  diagram  of  gist  features  computation  model  applied  to  every 
image  chip. 


matter  where  they  are,  and  may  serve  as  a  useful  position-in¬ 
variant  (and  somewhat  rotation-  and  scale-invariant)  descriptor 
of  the  image  chip.  Our  experiments  shown  in  the  following  will 
directly  test  this  hypothesis.  According  to  the  previously  shown 
analysis,  the  dimension  of  the  combined  saliency  feature  vector 
is.  Dimgal  =  -^feature  Channels  X  4  =  10  X  4  =  40. 

B.  Gist  Feature  Computation 

The  gist  feature  computation  model  [50]  is  related  to  the 
saliency  computation  model,  except  that  it  embodies  concepts 
of  feature  cooperation  across  space  rather  than  competition. 
The  gist  computation  model  architecture  used  in  the  present 
paper  is  shown  in  Fig.  3  and  the  low-level  features  channels 
include  intensity,  four  orientations  (0°,  45°,  90°,  and  135°), 
and  four  L-junctions  (0°,  45°,  90°,  and  135°),  four  T-junctions 
(0°,  45°,  90°,  and  135°),  four  endpoints  (0°,  45°,  90°,  and 
135°)  and  X-junction,  therefore,  18  different  feature  channels 
are  adopted. 

Unlike  the  saliency  feature  extraction  model,  both  center-sur¬ 
round  and  raw  (before  center-surround)  pyramid  levels  are 
exploited.  For  the  center- surround  operation,  six  center  sur¬ 
round  difference  maps  are  then  computed  within  each  pyramid 
as  point-to-point  difference  across  pyramid  scales,  for  com¬ 
bination  of  three  center  scales  (c  =  {2,3,4})  and  two 

center-surround  scale  differences  (^  =  {3,4}).  For  the  raw 
operation,  the  adopted  raw  pyramid  scales  range  from  0  to  4. 

Since  gist  features  describe  an  image  chip’s  overall  informa¬ 
tion,  we  only  use  mean  value  to  represent  each  of  the  gist  feature 
maps 

Gk,S,C  =  jp  GFk^s,c{hj)  (8) 

where  W  and  H  are  the  gist  feature  map  size,  indices  k, 
s,  c  denote  feature  map  type,  scale,  center-surround  type. 
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Fig.  4.  Example  of  complete  saliency-gist  feature  extraction  for  an  image  chip.  Note  that  the  saliency  maps  shown  already  have  been  subjected  to  spatial  com¬ 


petition;  hence,  for  example,  out  of  the  initially  many  responses  in  the  T-junction 
the  competition  strongly  and  dominating  the  other  ones  in  the  particular  example 
sented  in  pairs,  for  the  center- surround  and  raw  no-center-surround  computations, 
no-center-surround  result.  There  are  four  pairs  shown  in  this  figure:  intensity,  45 
(center)  and  5  (surround)  while  the  scale  of  no  center- surround  is  2. 

respectively.  Therefore,  the  gist  feature  vector  dimen¬ 
sions  are  Dimgist  =  Aifeature  Channels  X  (TV Center  Scales  X 
^Surround  Scales  +  N o  CS  Scales)  =  18  X  (3  X  2  +  5)  =  198. 

We  simply  combine  the  saliency  features  and  gist  features  to¬ 
gether  to  form  the  final  saliency-gist  feature  vector,  which  is  a 
40  -h  198  =  238  dimensions  vector.  One  example  of  the  com¬ 
plete  process  for  one  input  image  is  illustrated  in  Fig.  4.  Before 
using  these  feature  vectors  to  detect  targets,  it  is  necessary  to 
normalize  the  feature  values  alone  feature  types.  The  normal¬ 
ized  feature  then  can  be  sent  to  the  classifier  to  implement  detec¬ 
tion  task.  Considering  the  high  nonlinearity  of  the  feature  vec¬ 
tors’  distribution,  RBF  (radial  basic  function)  based  SVM  were 
adopted  to  complete  the  classification  task.  In  this  paper,  SVM 
provided  by  [52]  were  adopted  for  its  easy  to  use.  Furthermore, 


channel  at  various  locations  and  for  various  spatial  scales,  one  ends  up  winning 
image  chip  shown.  The  gist  feature  map  examples  shown  in  this  figure  are  pre- 
In  each  pair,  the  left  map  is  the  center- surround  result  while  the  right  map  is  the 
orientation,  135°  L-junction  and  endpoint.  The  scales  of  center  surround  are  2 


for  the  normalized  input,  the  parameters  of  SVM  can  be  opti¬ 
mized  automatically  and  no  tuning  is  needed. 

III.  Experiments  and  Results 

We  test  the  proposed  model  with  four  experiments  of  chal¬ 
lenging  broad  area  search  in  satellite  images.  Mainly,  the  search 
tasks  are  challenging  because  of  high  intraclass  variability  in  the 
target  category:  boats  in  experiments  1  and  2  (from  small  ves¬ 
sels  to  large  ships),  buildings  in  experiment  3,  and  airplanes  in 
experiment  4.  To  compare  our  algorithm  to  the  state  of  the  art, 
we  decided  to  employ  the  HMAX  [14],  [18],  SIFT  [7],  and  the 
hidden  scale  salient  structure  object  detection  algorithm  [16]  as 
references.  We  opted  for  HMAX  and  SIFT  because  of  their  pop¬ 
ularity  in  target  detection  and  in  generalization  over  object  cate- 
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Fig.  5.  Examples  of  target  image  chip  and  no  target  image  chip  for  experiment  I,  detecting  image  chips  which  contain  one  or  more  boat(s)  of  any  size  and  type. 
The  top-row  image  chips  include  one  or  more  target  boat,  while  the  bottom-row  images  do  not  include  any  target. 


gories  from  limited  training  data.  The  hidden  scale  salient  struc¬ 
ture  method  is  similar  to  our  research  and  performs  very  well  in 
target  detection  for  satellite  images.  All  these  references’  source 
code  is  available  and,  thus,  easy  to  implement  for  our  experi¬ 
ments.  To  complement  our  analysis,  we  also  compare  our  algo¬ 
rithm  to  Siagian-Itti’  s  gist  features  proposed  in  [50]  to  show  how 
much  is  gained  from  our  very  simple  4D  summaries  of  saliency 
maps  and  from  the  new  gist  features  used  here. 


A.  Experiment  1 

The  first  dataset  (dataset  1)  used  to  test  the  proposed  model 
includes  14  416  image  chips  (500  x  500)  which  were  cut  out  of 
one  large  broad-area  satellite  image  (size  21  500  x  27  500)  with 
a  slide  window  step  size  of  200  pixels  (hence,  two  successive 
chips  overlap  by  300  pixels).  All  target  centers  in  the  broad-area 
image  were  manually  labeled  as  ground  truth  (if  several  boats 
were  connected  together,  then  we  treated  them  as  one  target); 
the  boats’  sizes  ranged  from  tens  of  pixels  to  hundreds  of  pixels. 
Among  these  image  chips,  705  included  targets  (various  boats). 
Examples  of  target  image  chips  and  nontarget  image  chips  can 
be  seen  in  Fig.  5.  To  compare  the  effectiveness  of  the  proposed 
saliency-gist  approach  to  the  state  of  the  art,  we  compare  it  with 
the  gist  feature  proposed  in  [50]  (here  we  call  it  standard  gist 
feature),  the  HMAX  feature  [14],  [18],  the  SIFT  feature  [7]  and 
the  hidden  scale  salient  structure  feature  [16]. 

In  the  classification  step,  N  positive  image  chips  (which  in¬ 
clude  one  or  more  targets)  and  N  negative  image  chips  (which 
do  not  include  any  target)  are  randomly  selected  from  the  dataset 
and  used  as  the  training  samples,  while  all  remaining  image 
chips  are  treated  as  test  data.  The  commonly  used  measurement 
to  evaluate  the  precision  of  classification  are  percentage  of  true 
positive  (TP)  and  true  negative  (TN)  which  are  defined  as 


TP 

TPR  =  — —  X  100% 

TP  +  FN 

TN 

TNR  =  ^  X  100% 

TN  +  FP 


(9) 

(10) 


number  of  training  images 


Fig.  6.  Classification  results  for  experiment  I  (detecting  boats),  for  different 
numbers  of  training  images  from  the  pool  of  705  total  available  chips  containing 
one  or  more  targets  (error  bars  are  computed  from  100  runs  for  each  number  of 
training  examples,  selecting  the  examples  randomly  for  each  run). 


where  TPR  and  TNR  stands  for  true  positive  ratio  and  true 
negative  ratio.  The  classification  results  with  different  numbers 
of  training  samples  are  shown  in  Fig.  6.  It  is  easy  to  see  that 
when  we  increase  the  number  of  training  samples,  the  classifi¬ 
cation  rate  improves.  It  is  worth  noting  how,  even  with  a  small 
number  of  training  samples,  the  results  do  not  catastrophically 
degrade  but  rather  remain  quite  high  (above  80%  hits  and 
correct  rejections). 

For  a  classification  system,  pursuing  higher  TPR  and  lower 
false  positive  ratio  (FPR)  usually  contradict  each  other:  a  higher 
TPR  often  causes  higher  FPR.  With  different  decision  criteria, 
the  classification  results  may  vary.  For  example,  in  a  warning 
system,  pursuing  higher  TPR  is  preferred  to  pursuing  lower 
FPR.  Since  the  receiver  operating  characteristic  (ROC)  curve 
has  the  ability  to  show  the  comparison  of  TPR  and  FPR  as  the 
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Fig.  7.  ROC  curve  for  the  proposed  system  (zoomed-in  on  the  horizontal  axis) 
for  different  numbers  of  training  samples,  for  experiment  1  (detecting  boats). 
The  corresponding  ROC  area  values  and  standard  deviations  are  labeled  in  the 
legend.  (Standard  deviations  are  computed  from  100  runs  for  each  number  of 
training  examples,  selecting  the  examples  randomly  for  each  run). 


Fig.  8.  ROC  curve  comparison  among  different  feature  types  in  experiment 
1,  detecting  boats.  300  training  samples  were  used  for  both  the  positive  and 
negative  target  categories.  The  mean  ROC  area  values  (corresponding  to  the 
thick  curves)  and  standard  deviations  are  labeled  in  the  legend.  (Standard 
deviations  are  computed  from  100  runs  for  saliency-gist  feature,  standard  gist 
feature,  SIFT  feature  and  hidden  scale  salient  structure  feature,  and  from  a 
smaller  number  of  ten  runs  for  HMAX  feature  because  of  the  high  run-time  of 
HMAX).  The  shadow  envelopes  and  ten  thin  curves  for  each  model  show  the 
ROC  curves  which  reach  the  maximum  and  minimum  ROC  area  in  the  multiple 
runs  of  the  experiment  (using  different  randomly-chosen  training  samples  from 
the  training  set).  ROC  performance  for  the  proposed  Sal-Gist  algorithm  is 
significantly  better  than  for  all  other  methods. 


classification  decision  criterion  changes,  it  is  widely  adopted 
to  compare  performance  of  two  different  classification  systems. 
A  higher  TPR  while  low  FPR  stands  for  a  better  classification 
system,  and  usually  this  can  be  described  by  the  area  under  the 
ROC  curve.  An  ROC  area  equals  to  1  means  a  system  that  can 
perfectly  classify  the  categories  without  any  error,  an  ROC  area 
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Fig.  9.  ROC  curve  (zoomed-in  on  the  horizontal  axis)  for  different  numbers 
of  training  samples,  for  experiment  2  (detecting  boats,  with  training  set  from 
experiment  1).  The  corresponding  ROC  area  values  and  standard  deviations  are 
labeled  in  the  legend.  (Standard  deviations  are  computed  from  100  runs  for  each 
number  of  training  examples,  selecting  the  examples  randomly  for  each  run). 
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Fig.  10.  ROC  curve  comparison  among  different  feature  types  in  experiment  2, 
detecting  boats.  300  training  samples  were  used  for  both  the  positive  and  neg¬ 
ative  target  categories  (from  experiment  I’s  dataset).  The  corresponding  mean 
ROC  area  values  and  standard  deviations  are  labeled  in  the  legend  (100  runs 
for  saliency-gist  feature,  standard  gist  feature  and  SIFT  feature,  ten  runs  for 
HMAX  feature  because  of  the  high  complexity).  The  shadow  contours  stand  for 
the  ROC  curves  which  reach  the  maximum  and  minimum  ROC  area  in  multiple 
experiment  runs. 


equals  to  0.5  stands  for  a  random  classification  system,  and,  the 
bigger  ROC  area,  the  better  classification  performance.  To  com¬ 
pute  ROC  curves  with  our  algorithm,  we  systematically  vary 
distance  to  the  decision  boundary  as  the  criterion  parameter. 
Fig.  7  shows  ROC  curves  for  the  proposed  saliency-gist  algo¬ 
rithm,  as  a  function  of  the  number  of  training  examples.  We 
can  see  that  performance  degrades  gracefully  as  the  number  of 
training  examples  is  decreased.  The  corresponding  ROC  curves 
and  the  ROC  areas  of  classification  with  saliency-gist  feature. 
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Fig.  11.  Examples  of  target  image  and  no  target  image  for  experiment  3,  finding  buildings  of  any  type,  size,  and  style.  Top-row  images  include  one  or  more 
target(s)  while  the  bottom  row  images  do  not  include  any  target. 


HMAX  feature,  SIFT  feature,  hidden  scale  salient  structure  fea¬ 
ture  and  standard  gist  features  are  shown  in  Fig.  8.  (marked  as 
sal-gist,  HMAX,  SIFT,  sal- structure,  and  gist-std  in  the  figure, 
respectively).  It  is  clear  from  the  figure  that  the  saliency-gist  fea¬ 
ture  outperforms  the  other  features  greatly  (t- tests  on  the  100 
ROC  values  obtained  with  each  of  the  100  randomly  selected 
training  sets,  p  <  10“^^  or  better),  hence,  demonstrating  ap¬ 
peal  of  the  proposed  approach.  Also,  from  the  figure  we  can  see 
that  the  hidden  scale  salient  structure  method  almost  failed  in 
this  experiment.  This  is  mainly  because  the  targets  (boats)  are 
not  salient  compared  with  many  inland  buildings  when  using 
the  salient  structure  algorithm  in  [16]  and,  thus,  the  algorithm 
misclassified  many  buildings  as  boat  targets. 

B.  Experiment  2 

This  experiment  tests  how  training  on  one  broad-area  image 
taken  at  one  given  time  and  location  may  generalize  to  testing 
on  another  broad-area  image  taken  at  another  time  and  location. 
The  second  dataset  (dataset  2)  includes  11  385  image  chips 
(500  X  500)  which  were  cut  out  of  another  large  broad-area 
satellite  image  (size  23  300  x  20  100,  taken  from  the  same 
country  but  on  a  different  date  and  at  a  different  place  than  the 
broad- area  image  of  experiment  1),  with  the  same  slide  window 
size  as  in  dataset  1 .  We  labeled  the  targets  manually  as  ground 
truth  like  in  experiment  1  and  there  are  1  049  image  chips  which 
include  one  or  more  target(s).  In  this  experiment,  training 
samples  for  the  classifier  are  randomly  selected  from  dataset  1, 
while  all  the  image  chips  in  dataset  2  are  used  as  test  set. 

Fig.  9  shows  that  ROC  performance  improves  with  the 
number  of  training  samples,  as  in  experiment  1.  With  300 
training  samples,  ROC  area  was  0.952  here,  as  compared  to 
0.977  in  experiment  1  (Fig.  8),  suggesting  good  generalization 
capability  to  new,  never  seen  images.  Like  in  experiment  1,  the 
comparison  of  detection  results  with  the  saliency-gist  feature, 
standard  gist  features,  HMAX  features,  and  SIFT  features 
(the  hidden  scale  salient  structure  feature  is  not  adopted  to 
do  the  comparison  in  this  experiment  due  to  its  poor  perfor¬ 
mance  in  experiment  1)  shows  that  the  saliency-gist  feature 
performs  much  better  than  other  three  features  (Fig.  10,  t- tests, 
p  <  10“^^  or  better). 
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Fig.  12.  ROC  cures  (zoomed  in  on  the  horizontal  axis)  for  different  numbers 
of  training  samples  in  experiment  3  (detecting  buildings).  The  corresponding 
ROC  area  values  and  standard  deviations  are  labeled  in  the  legend.  (Standard 
deviations  are  computed  from  100  runs  for  each  number  of  training  examples, 
selecting  the  examples  randomly  for  each  run). 

C.  Experiment  3 

In  this  experiment,  targets  are  simply  defined  as  “buildings” 
in  satellite  images.  This  experiment,  thus,  tests  the  ability  of  our 
same  algorithm  to  classify  very  different  types  of  targets;  the  in¬ 
traclass  variability  here  is  also  arguably  even  larger  than  in  ex¬ 
periments  1  and  2  (see  Fig.  11).  The  dataset  (dataset  3)  used  here 
includes  108  885  image  chips  (this  experiment  used  a  smaller 
chip  size  of  256  x  256  because  the  targets  were  also  smaller 
than  in  experiments  1  and  2)  with  6  323  of  them  being  posi¬ 
tive  examples.  Fig.  11  shows  examples  of  buildings  and  neg¬ 
ative  examples.  Like  in  experiments  1  and  2,  the  image  chips 
were  cut  from  a  broad- area  satellite  image  (size  16  512  x  27 
520,  taken  from  a  different  country  and  a  different  year  than 
the  images  of  experiments  1  and  2).  The  slide  window  size  here 
was  64  pixels.  Ground-truth  information  for  this  dataset  (loca¬ 
tions  of  buildings)  was  provided  to  us  by  an  outside  corporation. 
The  ROC  curves  for  different  numbers  of  training  samples  are 
plotted  in  Fig.  12.  As  we  can  see,  performance  again  improves 
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Fig.  13.  ROC  curve  comparison  among  different  feature  types  in  experiment  3, 
detecting  buildings.  300  positive  and  300  negative  training  examples  were  used. 
The  experiment  parameters  are  the  same  as  in  experiments  1  and  2.  The  shadow 
contours  of  SIFT  and  hidden  scale  salient  structure  feature  are  quite  small  and 
can  not  seen  in  this  figure. 


with  the  size  of  the  training  set.  Again,  we  compare  the  detec¬ 
tion  results  with  the  standard  gist  features,  the  HMAX  features, 
the  SIFT  features  and  the  hidden  scale  salient  structure  features. 
The  corresponding  ROC  curves  of  classification  with  these  fea¬ 
ture  types  are  shown  in  Fig.  13.  It  is  clear  from  the  figure  that 
the  saliency-gist  features  again  outperform  the  other  two  fea¬ 
tures  greatly  (t-tests,  p  <  10“^^  or  better). 

To  illustrate  the  detection  result  in  a  more  straightforward  and 
global  way,  we  adopt  a  probability  map  representation  (PM)  to 
show  the  results.  A  probability  map  is  a  matrix  which  depicts 
the  probability  value  for  each  image  chip  to  contain  a  target. 
The  rescaled  broad-area  satellite  image  and  some  example  target 
buildings  are  shown  in  Fig.  14(a),  and  the  corresponding  prob¬ 
ability  map  is  shown  in  Fig.  14(b),  the  red  points  in  the  images 
stand  for  the  labeled  targets’  center  location.  This  simple  repre¬ 
sentation  reinforces  the  ROC  results  and  suggests  a  high  perfor¬ 
mance  of  the  algorithm,  as  shown  by  the  overlap  between  red 
ground  truth  locations  and  brighter  locations  in  the  PM  (higher 
probability  of  target  according  to  our  algorithm).  During  search 
for  buildings,  exploring  the  image  in  decreasing  order  of  target 
probability  per  our  algorithm  would  isolate  more  targets  faster 
than  a  naive  scan  from  left  to  right  and  top  to  bottom. 

D.  Experiment  4 

An  aerial  image  of  an  airport  is  adopted  in  this  experiment 
to  detect  the  “airplanes”  (see  Fig.  15).  The  dataset  (dataset  4) 
used  here  include  2  601  image  chips,  of  which  1  382  of  them  in¬ 
clude  a  target.  For  each  chip,  the  size  is  64  x  64  due  to  the  small 
target  size.  Compared  to  the  previous  experiments,  the  target 
is  relatively  easier  to  detect  because  intra  class  variances  (both 
in  shape  and  area)  are  small.  Here  we  compared  the  detection 
performance  among  saliency-gist  feature,  hidden  scale  salient 
structure  feature  and  SIFT.  Ten  positive  and  ten  negative  exam¬ 
ples  were  randomly  selected  as  training  data  while  the  rest  were 
taken  as  test  data.  The  detection  results  from  different  methods 
are  plotted  in  Fig.  16  (100  runs  for  each).  From  the  figure  we 
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can  see  that  all  three  methods  perform  very  well  while  the  pro¬ 
posed  method  performs  even  better  than  the  others  (no  shadow 
contours  plotted  here  because  the  difference  of  results  from  dif¬ 
ferent  method  is  small  while  the  variance  of  result  from  SIFT  is 
relatively  big  which  may  cause  the  whole  figure  not  clear). 

E.  Saliency  Versus  Gist 

As  saliency-gist  features  yield  great  classification  results,  it 
is  interesting  to  see  the  separate  contributions  of  the  saliency 
features  and  gist  features.  The  ROC  area  of  using  saliency  fea¬ 
tures  only,  gist  features  only  (in  our  new  implementation,  which 
includes  more  feature  channels  than  the  older  Gist-Std  model), 
and  combined  saliency-gist  features  in  all  four  experiments  are 
shown  in  Table  I.  It  can  be  seen  from  the  table  that  the  com¬ 
bined  saliency-gist  features  outperforms  both  saliency  features 
and  gist  features  in  all  experiments.  Hence,  these  results  show 
that  the  saliency  features  and  gist  features  are  not  fully  redun¬ 
dant,  even  though  they  are  computed  using  similar  low-level 
feature  detectors.  In  addition,  the  table  shows  that  saliency  fea¬ 
tures  perform  better  in  experiments  1,  2  and  4,  while  gist  fea¬ 
tures  perform  better  in  experiment  3.  Thus,  in  different  cases,  the 
classification  results  depends  more  on  different  types  of  infor¬ 
mation  (saliency  information  and  gist  information),  which  again 
reinforces  the  benefits  of  using  both  types  of  features. 

IV.  Discussion 

Our  results  show  that  the  proposed  algorithm  performs  better 
than  the  state-of-the-art  (HMAX  algorithm,  SIFT  algorithm, 
hidden  scale  salient  structure  algorithm  and  previously  pro¬ 
posed  gist  algorithm  alone)  in  difficult  target  search  scenarios. 
This  was  achieved  in  situations  where  targets  can  vary  greatly 
in  their  size,  shape,  and  number  of  targets  per  chip.  Overall, 
the  proposed  algorithm  is  conceptually  very  simple  and  at  the 
same  time  very  general,  since  the  feature  extraction  stages 
were  not  designed  or  tuned  for  the  specific  types  of  images 
and  targets  tested  here.  Taking  all  results  together  suggests  that 
the  proposed  system  may  be  further  applicable  to  a  wide  range 
of  images  and  target  types.  Indeed,  nothing  in  the  proposed 
algorithm  has  been  specifically  developed  or  tuned  for  the  boat 
or  building  or  airplane  targets  tested  here,  or  for  the  type  of 
images  processed  in  our  experiments. 

The  success  of  the  proposed  approach  may  be  due  to  our 
use  of  two  complementary  sets  of  biologically-inspired  features: 
gist  features  largely  discard  spatial  information,  while  saliency 
features  summarize  it.  In  the  human  brain,  it  is  clear  that  object 
recognition  relies  on  being  able  to  compute  invariants,  but  at  the 
same  time  pose  parameters  are  not  lost:  although  one  recognizes 
an  upside-down  face  as  being  a  face,  one  is  also  aware  that  it  is 
upside-down.  Our  approach  here  seems  to  benefit  from  this  dual 
view  of  the  image  data.  Recently,  some  other  biologically-in¬ 
spired  feature  extraction  methods  [19]  have  started  to  use  the 
“gestalt”  information  (continuity,  symmetry,  closure,  repetition, 
etc.)  to  conduct  object  detection  and  have  shown  promising  re¬ 
sults.  It  is  likely  that  combining  these  feature  types  will  get 
even  better  detection  performance.  There  are  many  other  fea¬ 
ture  types  which  could  be  also  added  to  our  approach,  including 
for  example  locally-binary  pattern  (LBP)  features  which  have 
been  particularly  successful  in  texture  segmentation  [58]. 
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Fig.  14.  Illustration  of  “building”  detection  in  experiment  3.  (a)  Rescaled  broad-area  satellite  image  (I65I2x  27  520  pixels)  and  some  target  examples,  (b)  Prob¬ 
ability  map  of  (a)  computed  by  our  algorithm.  The  red  points  are  the  true  target  center  locations.  In  the  probability  map,  lighter  areas  indicate  higher  probability  of 
targets,  while  darker  areas  denote  lower  probability  of  targets  according  to  the  algorithm. 


TABLE  I 

Comparison  of  ROC  Areas  of  Different  Types  of  Features  in  Four  Experiments 


Saliency  Feature 

Gist  Feature 

Saliency-Gist  Feature 

Experiment  1 

0.969  ±  0.003 

0.943  ±  0.008 

0.977  ±  0.003 

Experiment  2 

0.945  ±  0.007 

0.903  ±  0.009 

0.952  ±  0.005 

Experiment  3 

0.789  ±0.007 

0.905  ±  0.005 

0.922  ±  0.005 

Experiment  4 

0.927  ±0.031 

0.942  ±  0.028 

0.976  ±  0.003 

The  proposed  algorithm  does  not  take  any  complex  proce¬ 
dure  to  combine  the  features  extracted,  although  many  research 
studies  have  proposed  feature  combination  algorithms  to  im¬ 
prove  classification  performance  [59],  [60].  Here  we  only  show 
that  the  combination  of  gist  feature  and  salient  feature  are  com¬ 
plementary  and  can  achieve  good  performance  in  target  detec¬ 
tion.  It  is  interesting  that  saliency  and  gist  features  both  con¬ 


tribute  significantly  to  performance,  and  are  not  fully  redundant 
(Table  I).  This  suggests  a  new  use  of  saliency  algorithms,  for 
classification  of  images  based  on  their  saliency  maps,  as  op¬ 
posed  to  using  the  saliency  maps  to  generate  shifts  of  attention. 
It  is  interesting  to  think  whether  humans  and  other  animals  may 
use  this  as  well.  It  is  possible  that  human  saliency  maps  in  pos¬ 
terior  parietal  cortex,  the  pulvinar  nucleus,  the  frontal  eye  fields. 
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Fig.  15.  Image  used  to  detect  the  airplanes  in  experiment  4. 


Fig.  16.  ROC  curve  comparison  among  different  feature  types  in  experiment  4, 
detecting  airplanes.  Ten  positive  and  ten  negative  training  examples  were  used. 
*  indicate  statistically  different  ROC  performance  (t-test,  p  <  10  or  better). 


or  the  superior  colliculus  [31]  may  also  be  analyzed  in  a  holistic 
fashion  and  may  contribute  to  the  very  rapid  understanding  of 
the  rough  layout  of  the  scene.  That  is,  the  coarse  structure  of 
saliency  maps  may  combine  with  the  broad  semantic  informa¬ 
tion  provided  by  the  gist  features  to  yield  a  coarse  and  rapid 
understanding  of  both  a  scene’s  gist  and  layout  [61]. 

Our  approach  reinforces  the  idea,  as  shown  by  recent  suc¬ 
cesses  in  the  domains  of  statistical  machine  translation  of  text 
into  foreign  languages  or  of  speech  analysis  [62],  that  relatively 
shallow  statistical  analysis  of  large  datasets  can  yield  surpris¬ 
ingly  good  classification  and  recognition  results.  Indeed,  our 
algorithm  does  not  try  to  understand  the  geometric  structure 
or  other  specific  high-level  or  cognitive  feature  of  targets  (e.g., 
buildings  should  have  walls,  tend  to  be  rectangular,  etc)  and  is  not 


attempting  recognition  by  components  (breaking  down  target 
objects  into  elementary  parts  and  their  spatial  arrangements  [63] . 

The  proposed  algorithm  is  mostly  intended  as  a  front-end,  to 
be  used  to  perform  coarse  preliminary  analysis  of  large  complex 
scenes.  The  data  returned  certainly  is  still  far  from  representing 
a  complete  understanding  of  the  scene’s  contents.  However,  our 
algorithm’ s  output  can  be  used  in  at  least  two  practical  ways :  first, 
to  compute  statistics  at  the  region  level,  like,  e.g.,  finding  areas 
in  the  world  with  high  concentrations  of  boats,  or  determining 
which  regions  in  a  country  have  more  buildings  and,  hence, 
may  be  more  densely  populated.  Such  basic  statistics  may  be  of 
great  use  on  their  own,  for  example  when  planning  rescue  efforts 
following  a  natural  disaster,  or  may  assist  a  human  image  analyst 
in  performing  deeper  and  more  cognitively-driven  surveys  of 
imagery.  Second,  our  algorithm  can  be  used  to  rank  image  chips 
by  interest  (using  the  probability  maps  of  Fig.  14),  so  as  to  focus 
limited  resources  onto  the  most  promising  image  locations. 
Resources  may  be  limited  because  of  limited  human  personnel, 
human  viewing  time  (e.g.,  when  using  rapid  serial  visual  presen¬ 
tation  of  image  chips  [64],  or  computation  time  (e.g.,  using  amore 
sophisticated  and  time-consuming  object  recognition  back-end 
to  validate  high-probability  chips).  It  is  likely  that  our  system 
could  perform  even  better  if  one  was  to  apply  some  of  the  recog- 
nition-by-components  principles  or  other  recognition  back-end 
to  the  high-probability  target  chips  returned  by  our  algorithm. 

Thus  far,  our  algorithm  has  only  been  applied  to  greyscale  vis¬ 
ible  imagery.  With  the  increasing  popularity  of  color  and  multi- 
spectral  imagery,  it  remains  to  be  tested  in  future  work  whether 
our  simple  approach  will  scale  up  to  a  larger  number  of  spectral 
bands.  All  source  code  for  our  algorithms  is  available  on 
the  authors’  web  site  (http://iLab.usc.edu). 
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Abstract 

We  present  a  computationally  efficient  model  for  detecting  salient  regions  in  an  image  frame. 

The  model  when  implemented  on  a  portable,  wearable  system  can  be  used  in  conjunction  with 
a  retinal  prosthesis,  to  identify  important  objects  that  a  retinal  prosthesis  patient  may  not  be 
able  to  see  due  to  implant  limitations.  The  model  is  based  on  an  earlier  saliency  detection 
model  but  has  a  reduced  number  of  parallel  streams.  Results  of  a  comparison  between  the 
areas  detected  as  salient  by  the  algorithm  and  areas  gazed  at  by  human  subjects  in  a  set  of 
images  show  a  correspondence  which  is  greater  than  what  would  be  expected  by  chance. 

Initial  results  for  a  comparison  of  the  execution  speed  of  the  two  algorithm  models  for  each 
frame  on  the  TMS320  DM642  Texas  Instruments  Digital  Signal  Processor  suggest  that  the 
proposed  model  is  approximately  ten  times  faster  than  the  original  saliency  model. 

(Some  figures  in  this  article  are  in  colour  only  in  the  electronic  version) 


1.  Introduction 

An  electronic  retinal  prosthesis  is  under  development,  to 
treat  blinding  diseases  like  retinitis  pigmentosa  (RP)  and  age- 
related  macular  degeneration  (AMD)  [1].  In  RP  and  AMD, 
the  photoreceptor  cells  are  affected  while  other  retinal  cells 
remain  relatively  intact.  Photoreceptor  cells  convert  light 
information  entering  the  retina  into  electrical  signals  and  hence 
the  progressive  loss  of  these  cells  leads  to  a  gradual  loss  of 
vision  in  patients.  The  retinal  prosthesis  aims  to  provide 
partial  vision  by  electrically  activating  the  remaining  cells  of 
the  retina.  Current  retinal  prosthesis  prototypes  use  external 
components  to  acquire  and  code  image  data  for  transmission  to 
an  implanted  retinal  stimulator.  The  external  system  consists 
of  a  small  camera  to  capture  video  in  real  time  and  a  portable 
video  processor  to  convert  image  data  to  a  series  of  command 
signals  which  are  wirelessly  transmitted  to  the  implanted 
retinal  stimulator. 

Human  monocular  vision  has  a  field  of  view  close  to 
160°  [2].  Due  to  surgical  limitations  on  implant  size,  current 
retinal  prostheses  only  stimulate  the  central  15-20°  field  of 


view.  Prototype  systems  range  from  16  to  1550  electrodes 
[3-6],  which  is  well  below  the  resolution  of  the  retina  in 
this  region,  even  if  every  electrode  can  create  an  independent 
pixel.  If  the  entire  camera  image  (between  40°  and  60° 
field  of  view)  is  compressed  to  fit  the  central  field,  there 
will  be  a  loss  of  resolution  and  miniaturization  of  objects, 
with  a  likely  decrease  in  the  quality  of  vision.  Whereas, 
if  only  the  central  15-20°  field  of  view  from  the  camera 
image  is  extracted  and  stimulated  electrically,  the  visual 
information  will  be  more  organized  and  perceivable  to  the 
recipients.  However,  peripheral  information  will  be  lost, 
severely  hampering  mobility.  Hence,  there  is  a  need  for  a 
specific  image  processing  algorithm  which  could  be  used  to 
overcome  the  loss  of  peripheral  information  due  to  the  limited 
field  of  view. 

Retinal  prosthesis  research  can  involve  image  processing 
in  a  number  of  different  ways.  Several  studies  have  conducted 
simulated  vision  experiments  with  normal  sighted  volunteers 
performing  reading  or  mobility  tasks.  The  goal  of  these 
experiments  was  to  test  visual  task  performance  as  a  function 
of  pixel  number,  density  and  quality.  These  have  been  recently 
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reviewed  [7]  and  collectively,  these  studies  suggest  that  600- 
1000  electrodes  are  needed  for  functional  vision.  Another 
facet  of  image  processing  research  related  to  retinal  prostheses 
involves  the  conversion  of  the  image  data  into  a  stimulus 
pattern  that  best  conveys  the  desired  visual  perception.  Asher 
et  al  [8]  propose  real-time  image  processing  algorithms  and 
transformations  to  simulate  the  different  functions  of  the 
various  cell  layers  and  cell  layer  connections  in  the  retina. 
They  also  propose  a  conversion  method  for  transforming  the 
visual  information  into  electrical  current  patterns.  Hallum 
et  al  [9]  also  propose  a  method  for  converting  an  image  frame 
captured  by  the  camera  into  low  resolution  modulated  charge 
injections.  Finally,  image  processing  has  been  proposed  to 
enhance  certain  features  of  an  image  that  may  be  important  to 
the  user.  Boyle  et  a/  [10]  examined  accentuating  certain  image 
features.  One  finding  from  this  study  suggested  the  utility  of 
important  maps,  like  the  ones  created  by  the  bottom-up  visual 
attention  saliency  model  by  Itti  et  al  [11-15].  With  this  in 
mind,  we  propose  an  image  processing  algorithm  based  on 
the  saliency  detection  model  by  Itti  et  al  to  find  the  important 
and  salient  regions  in  the  entire  image  frame  and  cue  subjects 
toward  the  direction  of  the  salient  region. 

The  retina  along  with  higher  visual  processes  guides 
visual  attention  in  the  visual  cortex.  Visual  attention  binds 
information  from  multiple  parallel  processes  carrying  motion, 
depth,  color  and  form  information  in  the  visual  cortex 
[16].  During  visual  search  different  information  from  these 
processes  is  combined  first,  and  the  output  guides  the  attention 
deployment  process  [17,  18].  Computational  models  based 
on  saliency  detection  have  been  used  in  computer  vision 
and  robotics  to  predict  important  areas  in  the  visual  field. 
A  first  model  based  on  the  feature  integration  theory  was 
proposed  by  Koch  and  Ullman  [19].  The  model  is  a  bottom- 
up  saliency  detection  model  that  computes  a  saliency  map 
by  combining  several  basic  features  which  undergo  parallel 
processing.  Based  on  this  model,  a  bottom-up  model  based  on 
primate  vision  was  proposed  by  one  of  the  authors  of  this  paper 
[11-15].  This  model  (hereon  referred  to  as  the  ‘full  model’) 
forms  the  basis  of  many  implementations  of  visual  attention 
in  robotics  and  artificial  intelligence  [20]  and  also  forms  the 
basis  of  the  work  proposed  here.  We  propose  an  algorithm 
(hereon  referred  to  as  the  ‘new  model’)  that  is  based  on  the 
full  model,  but  with  simplifications  to  increase  efficiency,  to 
allow  execution  on  a  portable  processor.  In  this  paper,  we 
describe  the  new  model  in  detail  and  verify  that  it  can  predict 
human  gaze  using  a  library  of  images  and  human  observers. 

2.  Methods 

2.7.  New  model  algorithm 

The  new  model  (figure  1 )  uses  three  information  streams:  color 
saturation,  intensity  and  edge  information.  These  information 
streams  are  extracted  by  converting  the  input  image  from  the 
RGB  color  space  to  the  HSI  (hue-saturation-intensity)  color 
space.  This  conversion  can  be  done  in  various  ways.  The 
conversion  for  our  algorithm  was  done  using  the  function 
rgb2hsv  in  Matlab  from  Math  works  Inc.  Nine  scales  of  dyadic 
Gaussian  pyramids  [21]  are  created  for  the  saturation  (S), 
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Figure  1.  Diagram  of  the  proposed  algorithm. 

intensity  (I)  and  edge  (E)  information  by  successively  low  pass 
filtering  and  down  sampling  by  a  factor  of  2.  Edge  pyramids 
are  created  from  the  intensity  stream  based  on  Laplacian 
pyramid  generation  [22,  23] .  For  each  level  of  the  pyramid,  the 
edge  pyramid  image  is  created  as  a  point-by-point  subtraction 
between  the  intensity  image  at  that  level  and  the  interpolated 
intensity  image  from  the  next  level. 

Center-surround  mechanisms  observed  in  the  visual 
receptive  fields  of  the  primate  retina  are  then  implemented 
computationally  to  create  feature  maps  for  each  information 
stream.  Center-surround  interactions  are  modeled  as  the 
difference  between  the  coarse  and  fine  scales  of  the  pyramids 
[11-14].  Feature  maps  are  created  from  only  four  scales  with 
the  center  scales  ‘c’  at  levels  (3,  4)  and  surround  scales  ‘5’’ 
at  levels  (6,  7)  where  the  original  image  is  at  level  0  of  the 
pyramid.  For  c  G  {3,  4}  and  s  =  c  8  where  5  G  {3,  4} 
and  5'  <  8,  a  set  of  three  feature  maps  is  created  for  each 
stream.  Point-by-point  subtraction  between  the  values  of  the 
pyramids  at  the  finer  and  coarser  scales  is  carried  out  after 
interpolating  the  coarser  scale  to  the  finer  scale  using  bilinear 
interpolation.  Absolute  values  of  the  subtraction  are  calculated 
for  the  saturation  and  intensity  streams  and  all  feature-maps 
are  decimated  by  dropping  the  appropriate  number  of  pixels 
to  be  1/  16th  the  size  of  the  original  image: 

^(c,^)  =  15(c)- 5(^)1  (1) 

/(c,^)  =  |/(c)- 7(^)1  (2) 

E{c,s)  =  E(c)  -  E(s).  (3) 
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A  linear  summation  across  these  feature  maps  for  the 
different  information  streams  forms  the  conspicuity  maps — 
Sc  for  color  saturation,  4  for  intensity  and  Ec  for  edge 
information: 


4  c+4;.y<8 


c=3  5=c+3 

(4) 

4  c+4;5'<8 

c—3  5=c+3 

(5) 

4  c+4;5'<8 

c=3  s—c+3 

(6) 

The  conspicuity  maps  undergo  normalization  [11-15] 
referred  to  by  the  operator  J\f  in  the  equations.  Normalization 
is  an  iterative  process  that  promotes  maps  with  a  small  number 
of  peaks  with  strong  activity  and  suppresses  maps  with  many 
peaks  of  similar  activity.  Each  conspicuity  map  is  first 
normalized  to  a  fixed  range  between  0  and  1 .  Thereafter,  a  two- 
dimensional  difference  of  Gaussian  filter  (DoG)  is  convolved 
with  the  map  iteratively.  The  output  is  summed  with  the 
original  map  and  negative  values  are  set  to  zero.  The  DoG 
filter  results  in  the  excitation  of  each  pixel  with  inhibition 
from  neighboring  pixels.  The  DoG  filter  function  is  calculated 
as  stated  below: 


DoG(v,  y) 


0.5 


1.5 


(7) 


where  crgx  =  2%  and  cr inh  =  25%  of  the  input  image  width. 

Intensity  and  saturation  conspicuity  maps  use  three 
normalization  iterations,  and  edge  conspicuity  maps  use 
one  normalization  iteration.  The  number  of  iterations  for 
normalization  is  chosen  based  on  the  computational  load  and 
pilot  studies  that  examined  different  iterations  of  normalization 
and  their  effects  on  the  maps.  The  three  normalized 
conspicuity  maps  are  linearly  summed  and  their  average  forms 
the  final  salience  map  which  again  undergoes  a  three-iteration 
normalization.  The  region  around  the  pixel  with  the  highest 
grayscale  value  in  the  final  salience  map  signifies  the  most 
salient  region: 

N{Sc)  +  N{1c)+M{Ec) 

3 

There  are  a  few  key  differences  between  the  two  models 
which  make  the  new  model  less  computationally  intensive 
compared  to  the  full  saliency  model.  The  new  model  uses 
only  3  information  streams  for  processing  (versus  7  in  the  full 
model),  4  scales  of  Gaussian  pyramids  (versus  6),  18  feature 
maps  (versus  42).  Instead  of  using  the  two  color  opponent 
streams  as  found  in  the  primate  retina,  the  new  model  uses 
color  saturation.  Color  saturation  information  will  indicate 
purer  hues  with  higher  grayscale  values  and  impure  hues  with 
lower  grayscale  values.  One  stream  of  edge  information  is 
used  instead  of  four  different  orientation  streams.  For  creating 
feature  maps,  the  new  model  focuses  on  the  coarser  scales 
for  center  and  surround  which  represent  low  spatial  frequency 
information  in  the  image. 


2.2.  DSP  implementation 

The  retinal  prosthesis  image  processing  system  is  designed  to 
be  a  portable  module  worn  on  the  body.  For  this  reason, 
the  module  should  be  lightweight  and  compact  with  low 
power  requirements  so  that  it  can  operate  for  several  hours 
on  a  small  battery.  Fow  power  requirements  may  restrict  the 
amount  of  computation  that  can  be  carried  out  by  the  image 
processing  unit.  With  this  in  mind,  for  research  purposes, 
the  algorithm  has  been  implemented  on  a  Texas  Instruments 
Imaging  Developers  Kit  (IDK)  TMS320  DM642  [24].  The 
DM642  chip  is  a  720  MHz  fixed-point  processor  and  the 
IDK  is  specifically  designed  to  aid  the  development  of  image 
processing  algorithms.  The  IDK  is  not  a  portable  board  (it 
includes  many  functions)  but  in  general,  DSPs  are  designed 
for  low-power,  portable  applications,  so  the  technological  path 
is  clear. 

As  a  first  step  toward  analyzing  the  computational  speed 
of  the  new  model,  we  implement  the  new  model  and  full 
model  (partial)  on  the  DSP-IDK.  The  algorithms  are  modeled 
in  Simulink  from  Math  works  Inc.,  and  then  ported  to  the 
DSP.  For  efficiency,  filtering  has  been  implemented  using 
separable  one-dimensional  filters  for  both  the  models.  Only 
the  intensity  stream  of  the  full  model  is  implemented  (one  of 
the  seven  streams  of  the  full  model).  Fixed-point  hardware  can 
implement  numbers  in  both  fixed-point  precision  and  floating¬ 
point  precision.  Both  the  algorithm  implementations  are  in 
single-precision  floating-point  format  and  are  not  optimized. 

2.3.  Model  validation  using  gaze  data 

Gaze  experiments  were  carried  out  with  five  human  subjects 
after  the  approval  of  the  Institutional  Review  Board  at  the 
University  of  Southern  California.  A  signed  informed  consent 
was  obtained  from  each  participant  of  the  study.  Subjects  were 
required  to  have  English  speaking  and  reading  knowledge, 
be  18-f  years  of  age,  not  have  a  history  of  vertigo,  motion 
sickness  or  claustrophobia;  cognitive  or  language/hearing 
impairments,  and  have  a  visual  acuity  of  20/40  or  better  with 
normal  or  corrected  vision  (with  lenses).  Visual  acuity  testing 
was  carried  out  in  the  lab  using  a  Snellen  visual  acuity  eye 
chart. 

Gaze  data  were  acquired  using  an  eye  tracking  system 
from  Arrington  Research,  Inc.,  Scottsdale,  AZ.  This  system 
consists  of  a  Z800  3D  Visor  Head  Mounted  Display  (HMD) 
with  a  diagonal  field  of  view  of  40° .  Images  on  the  HMD  are 
displayed  at  a  resolution  of  800  x  600  pixels.  The  Viewpoint 
eye  tracking  software  from  Arrington  Research  recorded  data 
at  a  frequency  of  60  Hz  using  pupil  tracking.  Subjects  were 
seated  at  a  table  with  their  head  rested  on  a  chin  rest.  A  1 2  point 
rectangular  grid  calibration  process  was  used.  Subjects  were 
asked  to  look  at  the  center  of  12  different  squares  that  would 
successively  appear  on  the  HMD  screen.  After  this,  as  a 
measure  of  calibration,  a  test  image  consisting  of  a  circle  in 
the  center  of  the  screen  was  shown  and  subjects  were  asked  to 
look  at  the  center  of  the  circle.  Recording  was  not  done  until 
good  calibration  was  obtained.  Good  calibration  is  defined  as 
a  rectangular  grid  mapped  from  the  gaze  points  of  the  subjects 
when  looking  at  the  12  squares.  A  set  of  150  natural  images 
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was  displayed  on  the  HMD  with  each  image  being  shown  for 
3  s.  The  images  consisted  of  outdoor  and  indoor  environments. 
Subjects  were  instructed  to  freely  gaze  at  the  image.  To  avoid 
biasing  the  subjects,  no  other  instructions  were  given.  Between 
images,  the  test  circle  image  was  displayed  for  subjects  to  rest 
their  eyes.  However,  after  every  three  images,  subjects  were 
instructed  to  look  at  the  center  of  the  circle  and  the  calibration 
at  this  point  was  noted.  This  helped  to  keep  track  of  any 
calibration  drift  during  the  experiments.  In  post-processing, 
the  recorded  gaze  point  data  were  then  corrected  by  any  offset 
to  get  the  new  gaze  point  data,  corrected  for  calibration  drift. 

The  collected  gaze  data  were  filtered  for  fixations 
and  saccades  using  custom  fixation  and  saccade  filtering 
software  freely  available  on  http://ilab.usc.edu  as  part  of  the 
Neuromorphic  Vision  Toolkit.  Data  were  analyzed  using  gaze 
fixation  points  from  the  data  set.  Fixation  data  points  may  not 
account  for  drifts  in  eye  movements.  However,  by  taking 
a  circular  aperture  around  each  fixation  point  during  data 
analysis,  effects  of  drifts  as  well  as  slight  calibration  offsets 
can  be  avoided. 

Analysis  of  gaze  data  was  carried  out  using  methods  used 
by  Itti  [25]  and  Peters  et  al  [26]  to  analyze  the  contribution 
of  bottom-up  saliency  to  human  eye  movements.  For  each 
image,  gaze  data  points  from  all  subjects  were  pooled  together 
for  analysis.  For  the  same  set  of  150  input  images,  salience 
maps  created  by  the  full  model  were  used  for  a  comparison  of 
results  with  salience  maps  created  by  the  new  model. 

2.3.1.  Analysis  with  the  ratio  of  medians  method  [25]. 
is  defined  as  the  highest  value  of  saliency  within  a  circular 
aperture  of  diameter  5.6°  centered  at  the  fixation  point.  High 
values  of  indicate  that  the  human  observer  fixated  at  a  highly 

salient  region. 

Sr  is  defined  as  the  highest  value  of  saliency  within  a  5.6° 
circular  aperture,  centered  at  a  random  point  chosen  from  a 
uniform  distribution. 

5'max  is  defined  as  the  maximum  value  of  saliency  in  the 
salience  map  of  the  image. 

Each  image  will  have  approximately  between  20  and  40 
gaze  points  after  combining  gaze  data  from  all  subjects.  The 
same  number  of  points  are  randomly  chosen  from  a  uniform 
distribution  to  calculate  Sr.  To  get  a  more  accurate  estimate 
for  ^r,  100  sets  of  random  points  are  used  for  each  image,  each 
generating  an  value.  The  median  ^rm  of  this  set  of  values 
for  each  image  is  used  for  further  analysis.  Ratios  Sh/S^ax 
and  5'rm/5'max  and  the  medians  for  each  of  these  are  calculated. 
The  ratio  of  these  medians  is  then  calculated.  Higher  ratios 
mean  that  saliency  values  around  fixation  points  are  greater 
than  saliency  values  around  random  end  points,  showing  that 
the  model  can  predict  human  gaze  locations  in  the  image  better 
than  expected  by  chance. 

Image  shuffling.  Shuffling  is  a  control  analysis  where  instead 
of  using  gaze  points  for  the  image  in  consideration,  gaze  points 
of  another  randomly  chosen  image  are  used.  The  ratio  of 
medians  analysis  stated  above  is  done  using  the  saliency  maps 
from  one  image  and  gaze  data  from  the  randomly  chosen 


image.  These  results  are  then  compared  to  the  results  when 
using  the  saliency  maps  and  gaze  data  for  the  same  image. 

Differences  between  and  ^rm  were  evaluated  using  a 
statistical  sign  test  with  a  significance  level  of  0.0001  for 
both  the  full  and  new  models  for  the  cases  with  and  without 
shuffling.  Also,  the  same  statistical  test  was  carried  out 
between  the  values  with  and  without  shuffling  to  see  if 
the  values  with  shuffling  are  significantly  less  than  the 
values  without  shuffling. 

2.3.2.  Analysis  using  normalized  scanpath  salience  (NSS) 
[26].  This  method  normalizes  the  salience  map  to  have 
a  zero  mean  and  unit  standard  deviation.  For  each  point 
corresponding  to  the  fixation  locations,  the  normalized 
salience  value  is  extracted  and  the  mean  of  all  these  extracted 
values  is  calculated.  This  mean  is  the  normalized  scanpath 
salience  (NSS)  value.  If  the  NSS  value  is  greater  than 
zero,  there  is  a  greater  correspondence  between  the  salience 
maps  and  gaze  fixation  points  than  expected  by  chance. 
The  NSS  value  of  zero  would  mean  there  is  no  such 
correspondence  and  a  value  of  less  than  zero  would  mean 
there  is  anti-correspondence  between  the  salience  maps  and 
human  fixations.  To  verify  this  in  practice,  chance  values  are 
calculated  by  creating  a  map  with  a  uniform  distribution  at 
the  same  resolution  as  the  saliency  map  instead  of  the  actual 
salience  map  and  calculating  NSS  in  the  same  manner  as  stated 
above.  We  calculated  the  NSS  values  for  all  gaze  data  points 
by  taking  a  region  of  diameter  5.6°  around  each  fixation  point 
in  order  to  avoid  any  fixation  drifts  and  minor  calibration  offset 
effects.  Here  again,  random  map  generation  was  carried  out 
100  times  for  each  image. 

For  both  the  full  and  new  models,  NSS  values  obtained 
using  salience  maps  were  compared  to  the  NSS  values  obtained 
using  random  maps  (paired  Ftest  with  a  significance  level  of 
0.0001). 

3.  Results 

3.1.  Saliency  maps 

Figure  2  shows  the  salience  maps  generated  by  the  new  and 
full  models  for  the  same  input  image.  Figure  2(a)  shows  an 
example  of  an  input  image,  the  saturation,  intensity  and  edge 
conspicuity  maps  and  the  final  salience  map  created  by  the 
new  model.  Figure  2(b)  shows  the  salience  map  computed  by 
the  full  model  for  the  same  input  image.  On  comparing  the 
final  saliency  maps  from  the  new  and  the  full  models,  we  can 
observe  that  the  salient  image  areas  (e.g.  the  curb,  the  plant, 
etc)  are  similar  in  the  outputs  of  both  models.  The  new  model 
works  with  a  lower  resolution  image  (320  x  240  pixels)  than 
the  full  model  (640  X  480  pixels)  resulting  in  coarser  maps 
when  compared  to  the  full  model. 

Saturation  and  edge  conspicuity  maps  (figure  2(a)) 
enhance  objects  with  more  saturated  hues  and  darker  colors 
and  objects  with  prominent  edges  respectively  whereas  the 
intensity  conspicuity  map  enhances  objects  with  intensity 
contrast  in  the  image  frame.  The  process  of  creating  the  feature 
maps  and  normalizing  them  can  lead  to  certain  pixels  not  being 
enhanced  in  the  final  conspicuity  map. 


4 


J.  Neural  Eng.  7  (2010)  016006 


N  Parikh  et  al 


Saturation  Conspicuity 
Map 


Intensity  Conspicuity 
Map 


Edge  Conspicuity 
Map 


Final  Saliency  Map 


(b) 


Figure  2.  Salience  maps  created  by  the  new  model  and  the  full  model  for  the  same  input  image:  (a)  conspicuity  maps  for  saturation, 
intensity  and  edge  along  with  the  final  salience  map  created  by  the  new  model  for  an  example  input  image;  (b)  final  salience  map  created  by 
the  full  model  for  an  example  input  image. 


3.2.  DSP  implementation  results 

The  various  modules  of  each  model  and  the  time  required  to 
process  one  frame  are  stated  in  table  1 .  As  stated  earlier,  for 
the  full  model,  only  the  intensity  stream  which  is  one  of  seven 
different  streams  has  been  implemented  on  the  DSP. 

Execution  time  results  from  table  1  show  that  a  single 
image  frame  takes  0.84  s  to  be  processed  by  the  new  model 
whereas  14%  of  the  old  model  (only  the  intensity  stream) 
takes  1.53  s  to  process  the  same  image.  This  shows  that  the 
implementation  of  the  new  model  is  computationally  more 
efficient.  The  estimated  time  for  the  full  model  to  execute 
one  frame  can  be  calculated  by  multiplying  the  time  for 
the  intensity  stream  execution  by  a  factor  of  7.  This  is 
because  the  intensity  stream  is  one  of  seven  similar  streams 
in  terms  of  the  computational  complexity  in  the  original 


saliency  algorithm.  This  implies  that  the  implementation 
of  the  new  model  is  approximately  ten  times  faster  than  the 
implementation  of  the  full  model. 

Optimization  can  lead  to  better  results  as  can 
improvements  in  processor  speed  and  power  consumption. 
However,  in  wearable  computing  systems,  increased  algorithm 
efficiency  will  always  translate  into  lower  power  consumption. 
Even  the  unoptimized  implementation  of  the  new  model  can 
execute  in  less  than  1  s,  which  is  a  reasonable  response  time 
to  a  user  request  for  information. 

3.3.  Model  validation  using  gaze  data 

3.3.1.  Analysis  with  the  ratio  of  medians  method  [  25  ].  Eigure  3 
shows  two  examples  of  input  images  with  salience  maps  from 
the  new  model  and  the  full  saliency  model.  Points  in  images 
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(e) 


Figure  3.  Input  image  (a  and  f),  salience  maps  from  new  model  (b  and  g)  and  full  saliency  model  (c  and  h)  with  the  dots  depicting  gaze 
fixation  points,  and  salience  maps  from  the  new  model  (d  and  i)  and  full  saliency  model  (e  and  j)  with  dots  depicting  data  points  after 
shuffling. 


Table  1.  Time  in  seconds  for  computation  of  different  modules  in  the  new  model  and  intensity  stream  of  the  full  model  on  the  TMS320 
DM642  DSP. 


Functions 

New  model 
implementation 
(time  in  seconds) 

Intensity  stream  from  the 
full  model  implementation 
(time  in  seconds) 

YCbCr  ^  RGB 

0.1250 

_ 

RGB  ^  HSI 

0.0320 

- 

Gaussian  pyramids  (intensity  and 
saturation  for  new  model) 

0.0647 

0.0695 

Laplacian  pyramids 

0.0303 

- 

Center-surround  maps 

0.0027 

0.0158 

Normalization  function 

- 

0.6062 

(at  different  scales) 

- 

0.0757 

0.0096 

0.0113 

Entire  algorithm  (s) 

0.8416 

1.5373 

Frames/second 

1.1882 

0.6505 
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Figure  4.  Gaze  distribution  of  all  subjects  over  all  images  (a)  and 
the  average  salience  map  from  the  salience  maps  of  all  images  (b)  in 
the  entire  data  set. 


Table  2.  Data  analysis  of  human  gaze  data  with  salience  maps 
created  by  the  new  model  and  the  full  saliency  model  using  the  ratio 
of  medians  method. 


Median 

(5'h/5'max) 

Median 

(5*rm/  5max) 

Ratio  of 
medians 

Sign  test 
(tSh  and  iSrm) 

New  model 

0.3647 

0.1020 

3.5769 

p  <  0.0001 

Full  model 

0.4352 

0.2457 

1.7714 

p  <  0.0001 

Image  shuffling 
New  model 

0.2275 

0.1059 

2.1481 

p  <  0.0001 

Full  model 

0.3256 

0.2511 

1.2970 

p  <  0.0001 

(b),  (c),  (g)  and  (h)  depict  gaze  fixation  points  from  human 
subject  data  and  the  points  in  images  (d),  (e),  (i)  and  (j)  depict 
data  points  obtained  by  shuffling  (gaze  points  from  another 
image). 

Table  2  shows  analysis  of  the  full  and  the  new  models  for 
the  actual  gaze  data  and  randomly  distributed  gaze  points. 
Both  the  full  and  the  new  models  have  ratios  which  are 
significantly  above  chance  (sign  test  with  a  significance  level 
of  0.0001  carried  out  between  the  and  ^rm,  chance  =1) 
indicating  that  both  models  predict  better  than  chance  where 
human  observers  will  look.  The  ratio  of  medians  calculated 
by  the  new  model  is  higher  than  the  full  model  which  shows 
that  the  new  model  outperforms  the  full  model  in  this  case. 
The  maps  from  the  full  model  are  slightly  denser  than  the 
maps  from  the  new  model  as  seen  in  the  comparison  between 
figures  3(i)  and  (j).  This  results  in  the  overall  median  values 
of  the  full  model  being  greater  than  the  new  model. 

The  shuffled  image  analysis  results  are  also  shown  in 
table  2.  A  statistical  sign  test  with  a  significance  level  of 
0.0001  between  the  values  with  and  without  shuffling 
indicates  that  the  values  without  shuffling  are  significantly 
higher  than  the  values  with  shuffling.  The  shuffled  analysis 
shows  that  the  median  values  and  ratios  are  lower  than  when 
the  saliency  maps  and  gaze  data  correspond,  but  the  ratios  are 
statistically  greater  than  one,  that  is,  better  than  chance.  This 
discrepancy  can  be  explained  by  the  center-bias  effect  present 
in  the  average  salience  map  of  all  images  as  well  as  in  the  gaze 
data  of  subjects.  When  looking  at  unfamiliar  images,  subjects 
often  start  looking  at  the  center  and  then  proceed  to  examine 
the  peripheral  areas.  Subjects  are  asked  to  look  at  the  center  of 
a  test  image  after  every  three  images  for  calibration  purposes  as 
mentioned  before  which  could  also  add  to  their  initial  fixation 
being  centrally  biased.  Finally,  due  to  potential  photographer 
bias  (having  interesting  objects  in  the  center  of  the  image),  the 


Table  3.  Data  analysis  of  human  gaze  data  with  salience  maps 
created  by  the  new  model  and  the  full  saliency  model  using  the  ratio 
of  medians  method  for  the  image  data  set  after  removing  images 
with  a  center  bias  in  the  gaze  data  and/or  the  salience  maps. 


Median 

(tSh/  ‘S'niax) 

Median 

(5*rm/  5niax) 

Ratio  of 
medians 

Sign  test 
(iSh  and  iSj-ixi) 

New  model 
Full  model 

0.3490 

0.4278 

0.1020 

0.2519 

3.4231 

1.6985 

p  <  0.0001 
p  <  0.0001 

average  of  all  the  salience  maps  in  the  input  image  data  set 
also  has  a  center  bias.  Figure  4  shows  the  center-bias  in  the 
gaze  data  as  well  as  the  average  salience  map. 

To  investigate  the  finding  that  saliency  and  gaze  were 
correlated  even  with  image  shuffling,  a  center-bias  analysis  of 
the  gaze  points  from  subjects  as  well  as  the  average  salience 
map  was  done.  Based  on  Tatler’s  analysis  [27],  for  each  image, 
the  number  of  gaze  points  falling  into  the  central  15°  was 
counted  and  compared  to  the  number  of  gaze  points  falling  into 
the  rest  of  the  image  areas  which  are  referred  to  as  peripheral 
areas.  If  the  number  of  gaze  points  in  the  central  region  was 
greater  than  the  number  in  the  peripheral  regions,  there  was  a 
center  bias  in  gaze  data.  Calculations  show  that  26%  of  the 
images  used  in  our  study  have  a  center  bias  in  the  subject  gaze 
data.  Similarly,  the  number  of  pixels  whose  grayscale  level  is 
the  maximum  value  of  the  average  salience  map  is  calculated 
in  the  central  and  peripheral  areas  of  the  average  salience  map. 
If  the  number  of  such  maximum  grayscale  valued  pixels  is 
greater  in  the  center  than  in  the  periphery,  there  is  said  to  be  a 
center  bias.  Figure  4(b)  shows  the  average  saliency  data  from 
all  images,  indicating  central  bias.  The  bias  in  the  subject  gaze 
while  viewing  these  unseen  natural  images  and  the  bias  in  the 
salience  maps  due  to  the  photographer  bias  may  be  a  reason 
behind  the  ratio  of  medians  being  greater  than  1  even  with 
image  shuffling.  In  general,  if  the  combination  of  the  shuffled 
gaze  data  set  and  the  salience  map  is  such  that  both  have  an 
overlap,  the  ratio  for  such  combinations  will  be  greater  than  1. 

Analysis  was  repeated  after  removing  images  with  a 
central  bias.  Table  3  shows  these  results.  The  median  values 
for  Sh/Sjnax  and  5'rm/5'max  are  very  close  to  those  obtained 
with  the  entire  set  of  images  including  the  ones  with  a  center 
bias.  As  before,  a  sign  test  with  a  significance  level  of  0.0001 
carried  out  between  the  and  ^rm  shows  that  values  are 
significantly  higher  than  ^rm  values. 

3.3.2.  Analysis  using  normalized  scanpath  salience  (NSS) 
[26].  Figure  5  shows  two  examples  of  an  input  image  with 
salience  maps  from  the  new  model  and  the  full  saliency  model. 
The  figure  also  shows  a  random  map  created  from  a  uniform 
distribution  at  the  same  resolution  as  the  salience  maps  for  the 
new  and  full  models.  The  dots  in  the  salience  and  random  maps 
represent  the  gaze  fixation  points  of  human  observers. 

The  NSS  results  are  shown  in  table  4  for  the  new  model 
as  well  as  the  full  model.  NSS  values  for  both  models  with 
salience  maps  are  greater  than  zero  whereas  the  NSS  value 
for  the  random  model  is  close  to  zero.  The  results  for  the 
analysis  on  the  data  set  of  images  after  removing  images 
with  a  gaze  or  salience  map  center  bias  are  shown  in  table  5. 
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(i)  0) 


Figure  5.  Input  image  (a  and  f),  salience  maps  from  the  new  model  (b  and  g)  and  the  full  saliency  model  (c  and  h);  uniform  distribution 
random  map  with  the  dots  depicting  the  gaze  fixation  points  of  the  human  subjects  for  the  new  model  (d  and  i)  and  for  the  full  model 
(eandj). 

Table  4.  Data  analysis  of  human  gaze  data  with  salience  maps  created  by  the  new  model  and  the  full  model  using  the  normalized  scanpath 
salience  method. 


For  salience  map  For  random  map  Paired 

NSS  ±  SEM  NSS  ±  SEM  t-test 


New  model  0.4310  ±  0.0113  -4.813  x  lO^-^^  ±  0.0005  p  <  0.0001 
Eull  model  0.4758  ±  0.0098  -5.077  x  lO^-^^  ±  0.0005  p  <  0.0001 


Table  5.  Data  analysis  of  human  gaze  data  with  salience  maps  created  by  the  new  model  and  the  full  model  using  the  normalized  scanpath 
salience  method  for  the  image  data  set  after  removing  images  with  a  center  bias  in  the  gaze  data  and/or  the  salience  maps. 


For  salience  map 

Eor  random  map 

Paired 

NSS  ±  SEM 

NSS  ±  SEM 

/-test 

New  model 

0.4153  ±  0.0104 

1.2311  X  lO'-'^’  ±  0.0006 

p  <  0.0001 

Full  model 

0.4746  ±  0.0093 

1.9836  X  10'-'^’  ±  0.0005 

p  <  0.0001 

For  both  cases,  a  paired  /-test  with  a  significance  level  of 
0.0001  shows  that  the  NSS  values  obtained  using  salience 
maps  are  significantly  different  than  the  NSS  values  obtained 
using  random  maps,  meaning  there  is  greater  correspondence 
between  salient  regions  detected  by  the  salience  maps  and 
human  fixations  than  expected  by  chance. 


4.  Discussion 

We  present  a  computationally  efficient  model  of  bottom- 
up  saliency  detection  based  upon  an  earlier  saliency  model 
[11-15].  Good  correspondence  is  noted  when  comparing 
regions  that  the  algorithm  predicts  as  salient  to  regions  gazed 
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at  by  human  subjects  when  looking  at  a  set  of  images.  Also 
comparing  the  salient  regions  to  random  gaze  points  shows 
that  the  model  predicts  salient  regions  at  a  rate  better  than  what 
would  be  expected  by  chance.  We  have  validated  our  algorithm 
with  sighted  observers  viewing  images  on  a  computer  screen 
while  seated.  The  subjects  are  shown  unfamiliar  scenes  to 
limit  unbiased  gaze  patterns.  Using  a  set  of  150  images 
and  gaze  data  from  five  subjects,  results  show  comparable 
performance  between  the  new  and  the  full  models.  An 
unoptimized  implementation  on  the  TMS320  DM642  DSP 
shows  that  the  proposed  model  can  process  at  a  rate  of  1  frame 
per  second  which  is  approximately  ten  times  faster  than  the 
full  model.  Since  this  algorithm  is  eventually  hoped  to  be 
run  on  a  wearable  computing  platform,  efficiency  is  a  critical 
factor. 

The  model  is  proposed  as  the  core  of  an  image  processing 
algorithm  designed  to  provide  visual  prosthesis  patients  with 
information  about  the  areas  outside  the  visual  field  of  the 
implant.  Such  an  algorithm  could  be  utilized  in  a  number 
of  ways.  During  navigation  and  ambulation,  the  user  might 
want  to  know  about  obstacles  or  signs  (for  example,  an 
exit  sign).  Other  times,  the  user  might  be  searching  for  an 
object  of  interest.  While  it  is  possible  to  design  specific 
algorithms  tailored  to  each  task,  there  are  advantages  to  a 
bottom-up  approach.  Unlike  top-down  algorithms  that  require 
a  priori  information,  bottom-up  algorithms  do  not  require  any 
training.  Also,  a  bottom-up  algorithm  may  allow  the  user  to 
identify  objects  and  understand  surroundings  using  remnant 
vision  and  contextual  cues.  Nevertheless,  it  is  possible  that  a 
top-down  algorithm  may  be  needed  for  specific  tasks  such 
as  objective  recognition,  particularly  where  vision  is  very 
poor.  Frintrop  et  al  proposed  a  saliency  implementation 
based  on  ten  information  streams  and  a  five  level  image 
pyramid  scheme  for  robotics  [20].  Walther  et  al  proposed 
a  bottom-up  implementation  based  on  the  full  model  with  an 
added  feedback  module  to  detect  the  extent  of  an  attended 
object  [28].  Both  the  groups  combined  their  bottom-up 
implementations  by  using  top-down  information  based  on 
feature  detection  and/or  object  recognition  with  the  salient 
regions  [29,  30]. 

To  be  effective  for  a  retinal  prosthesis  implant  patient, 
more  work  is  required  to  understand  what  functions  are 
important  for  these  patients.  Training  will  also  be  required 
to  best  utilize  information  provided  by  the  algorithm.  Also, 
it  is  unclear  if  patients  can  learn  to  take  advantage  of 
the  additional  information  or  if  they  will  prefer  to  receive 
unfiltered  video  data  and  make  their  own  judgments  about 
object  importance.  Task-dependent  processing  may  be  the 
best  approach.  For  obstacle  avoidance  and  route  planning, 
visually  impaired  people  are  likely  to  be  more  interested  in 
large  objects  obstructing  their  path  versus  small  details  of 
the  environment  around  them.  In  such  a  case,  a  saliency 
algorithm  may  be  adequate.  For  an  object  detection  task, 
smaller  details  may  be  important  discriminating  clues  to  aid 
successful  task  completion  and  a  top-down  object  recognition 
algorithm  may  be  required.  Any  additional  algorithm  will 
require  more  computing  power  so  additional  benefit  will  come 
at  a  cost.  The  extra  computing  load  can  be  limited  by  applying 


object  recognition  only  in  the  small  region  identified  as  salient. 
This  would  also  eliminate  the  need  for  the  entire  image  frame 
to  be  processed  in  smaller  parts  by  the  object  recognition 
algorithm  to  find  the  various  objects. 

In  summary,  a  computationally  efficient  image  processing 
algorithm  has  been  analyzed  and  identifies  parts  of  an  image 
that  human  observers  also  deem  salient.  This  algorithm 
has  the  potential  to  enhance  low  vision,  particularly  when 
visual  field  is  restricted.  When  used  with  a  retinal  prosthesis, 
the  algorithm  can  be  implemented  on  the  retinal  prosthesis’ 
existing  camera  and  wearable  computing  platform.  Critical 
questions  remaining  to  be  answered  by  human  testing  include 
how  quickly  people  learn  to  utilize  the  algorithm  and  the 
benefit  provided. 
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With  the  recent  proliferation  of  robust  but  computationally  demanding  robotic  algorithms,  there  is  now  a  need 
for  a  mobile  robot  platform  equipped  with  powerful  computing  facilities.  In  this  paper,  we  present  the  design 
and  implementation  of  Beobot  2.0,  an  affordable  research-level  mobile  robot  equipped  with  a  cluster  of  16  2.2- 
GHz  processing  cores.  Beobot  2.0  uses  compact  Computer  on  Module  (COM)  processors  with  modest  power 
requirements,  thus  accommodating  various  robot  design  constraints  while  still  satisfying  the  requirement  for 
computationally  intensive  algorithms.  We  discuss  issues  involved  in  utilizing  multiple  COM  Express  modules 
on  a  mobile  platform,  such  as  interprocessor  communication,  power  consumption,  cooling,  and  protection 
from  shocks,  vibrations,  and  other  environmental  hazards  such  as  dust  and  moisture.  We  have  applied  Beobot 
2.0  to  the  following  computationally  demanding  tasks:  laser-based  robot  navigation,  scale-invariant  feature 
transform  (SIFT)  object  recognition,  finding  objects  in  a  cluttered  scene  using  visual  saliency,  and  vision-based 
localization,  wherein  the  robot  has  to  identify  landmarks  from  a  large  database  of  images  in  a  timely  manner. 
For  the  last  task,  we  tested  the  localization  system  in  three  large-scale  outdoor  environments,  which  provide 
3,583,  6,006,  and  8,823  test  frames,  respectively.  The  localization  errors  for  the  three  environments  were  1.26, 
2.38,  and  4.08  m,  respectively.  The  per-frame  processing  times  were  421.45,  794.31,  and  884.74  ms  respectively, 
representing  speedup  factors  of  2.80,  3.00,  and  3.58  when  compared  to  a  single  dual-core  computer  performing 
localization.  ©  2010  Wiley  Periodicals,  Inc. 


1.  INTRODUCTION 

In  the  past  decade,  researchers  in  the  field  of  mo¬ 
bile  robotics  have  increasingly  embraced  probabilistic  ap¬ 
proaches  to  solving  hard  problems  such  as  localization 
(Fox,  Burgard,  Dellaert,  &  Thrun,  1999;  Thrun,  Fox,  & 
Burgard,  1998;  Thrun,  Fox,  Burgard,  &  Dellaert,  2000),  vi¬ 
sion  (Heitz,  Could,  Saxena,  &  Koller,  2008;  Wu  &  Nevada, 
2007),  and  multirobot  cooperation  (Fox,  Burgard,  Kruppa, 
&  Thrun,  2000;  Thrun  &  Liu,  2003).  These  algorithms  are  far 
more  sophisticated  and  robust  than  the  previous  generation 
of  techniques  (Brooks,  1986;  Maes  &  Brooks,  1990;  Pomer- 
leau,  1993).  This  is  because  these  contemporary  techniques 
can  simultaneously  consider  many  hypotheses  in  forms  of 
multimodal  distributions.  Because  of  that,  however,  they 
are  also  far  more  computationally  demanding.  For  exam¬ 
ple,  a  visual  recognition  task,  in  which  we  need  to  com¬ 
pare  the  current  input  image  captured  by  a  camera  against 
a  large  database  of  sample  images  (Bay,  Tuytelaars,  &  Cool, 
2006;  Lowe,  2004;  Mikolajczyk  &  Schmid,  2005),  requires 
not  only  that  robust  visual  features  be  extracted  from  the 
input  image — which  already  is  a  computationally  demand¬ 
ing  task — ^but  also  that  these  features  be  matched  against 
those  stored  in  the  database — an  even  more  demanding 
task  when  the  database  is  large.  As  a  point  of  reference, 
comparing  two  320  x  240  images  using  scale-invariant  fea¬ 
ture  transform  (SIFT)  features  (Lowe,  2004)  can  take  1-2  s 
on  a  typical  3-CHz  single-core  machine.  To  be  able  to  run 
such  algorithms  in  near  real  time,  we  need  a  mobile  robot 


equipped  with  a  computing  platform  significantly  more 
powerful  than  a  standard  laptop  or  desktop  computer. 

However,  existing  indoor  and/or  outdoor  mobile 
robot  platforms  commercially  available  to  the  general  re¬ 
search  community  still  appear  to  put  little  emphasis  on 
computational  power.  In  fact,  many  robots,  such  as  the  Seg- 
way  RMP  series  (Segway,  Inc.,  2009),  have  to  be  separately 
furnished  with  a  computer.  On  the  other  hand,  robots  that 
come  equipped  with  multiple  onboard  computers  either  do 
not  use  the  most  powerful  computers  available  today  [e.g., 
the  Seekur  (MobileRobots,  Inc.,  2009),  which  relies  on  the 
less  powerful  PC /1 04  standard]  or  have  fewer  computers 
(e.g.,  Carnegie  Mellon  University  Robotics  Institute,  2009; 
Willow  Carage,  2009)  than  our  proposed  solution. 

Before  describing  the  design  and  implementation  of 
our  robot,  in  Section  1.1  we  survey  the  current  trends  in  the 
mobile  robot  market  and  identify  the  most  desirable  fea¬ 
tures  in  an  ideal  research  robot  (aside  from  our  central  re¬ 
quirement  of  powerful  computational  facilities).  Note  that 
some  of  the  robots  discussed  below  may  no  longer  be  avail¬ 
able  (or  may  never  have  been)  to  the  general  public.  We  in¬ 
clude  them  nonetheless  for  completeness  of  our  analysis. 

We  then  describe  our  main  contribution  in  Section  1.2, 
the  design  and  implementation  of  our  proposed  platform, 
Beobot  2.0,  a  powerful  mobile  robot  platform  equipped 
with  a  cluster  of  16  2.2-CHz  processing  cores.  Our  robot 
uses  compact  Computer  on  Module  (COM)  processors 
with  modest  power  requirements,  thus  accommodating 
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various  robot  design  constraints  while  still  satisfying  the 
requirement  for  computationally  intensive  algorithms. 

Our  complete  design  specifications,  including  supplier 
and  cost  information  for  almost  all  the  materials,  are  freely 
available  on  the  Internet  (Siagian,  Chang,  Voorhies,  &  Itti, 
2009).  As  the  manufacturing,  assembly,  and  machining  de¬ 
tails  are  available  online,  in  this  paper  we  focus  on  (1)  the 
design  decisions  we  made,  the  implementational  issues  we 
faced,  and  how  we  resolved  them,  and  (2)  experimental 
testing  of  the  robot  in  diverse  tasks. 

1 A «  Current  Mobile  Robot  Platforms 

In  the  current  state  of  robotics,  researchers  utilize  a 
variety  of  mobile  robots,  from  the  commercially  avail¬ 
able  (iRobot  Corporation,  2009b;  MobileRobots,  Inc.,  2009; 
Willow  Garage,  2009)  to  the  custom  made  (Carnegie 
Mellon  University  Robotics  Institute,  2009).  These  robots 
are  built  for  many  different  environments,  such  as  un¬ 
derwater  (iRobot  Corporation,  2010;  USC  Robotics,  2009), 
aerial  (Finio,  Eum,  Oland,  &  Wood,  2009;  He,  Prentice,  & 
Roy,  2008),  and  land  (Quigley  &  Ng,  2007;  Salichs,  Barber, 
Khamis,  Malfaz,  Gorostiza,  et  al.,  2006).  Here  we  focus  on 
land  robots  of  a  size  close  to  that  of  an  adult  human  that 
can  traverse  most  urban  environments,  both  indoors  and 
outdoors,  and  for  considerable  distances.  In  addition,  it  is 
versatile  enough  for  research  in  many  subfields  such  as  lo¬ 
calization/navigation,  human-robot  interaction,  and  mul¬ 
tirobot  cooperation. 

Furthermore,  we  primarily  focus  on  sites  that  are  simi¬ 
lar  to  a  college  campus  setting,  which  is  mostly  paved  with 
some  rough /uneven  roads,  not  terrains  that  one  would 
see  in  combat  zones.  Nowadays,  because  there  is  a  con¬ 
certed  effort  by  most  governments  to  make  pertinent  loca¬ 
tions  accessible  to  the  disabled  (using  wheelchairs),  legged 
robots  [from  the  small  QRIO  and  AIBO  by  Sony  (Sony 
Entertainment  Robot  Europe,  2009)  to  the  human-sized 
Honda  Asimo  (American  Honda  Motor  Co.,  Inc.,  2009)] 
are  no  longer  a  must.  A  wheeled  platform  would  suffice 
for  the  target  environments.  However,  the  ability  to  tra¬ 
verse  reasonably  sloped  terrain  (about  10  deg)  should  also 
be  expected.  Also,  some  form  of  weather  protection  in  the 
outdoors  is  essential.  Although  the  robot  is  not  expected 
to  operate  in  all  kinds  of  harsh  weather  (pouring  rain, 
for  example),  like  Seekur  and  Seekur  jr.  by  MobileRobots 
(MobileRobots,  Inc.,  2009)  and  IRoboTs  PackBot  (iRobot 
Corporation,  2009a),  it  should  nevertheless  be  able  to  han¬ 
dle  most  reasonable  conditions. 

An  overall  size  that  is  close  to  that  of  an  adult  human 
is  ideal  because  the  robot  would  be  small  enough  to  go 
through  narrow  building  corridors  and  yet  large  enough 
to  travel  between  buildings  in  a  timely  manner.  And  thus 
we  exclude  small  robots  such  as  the  Khepara  (AAI  Canada, 
Inc.,  2009)  or  large  robotized  cars  such  as  the  entries  to 
the  DARPA  Grand  Challenge.  Smaller  robots  such  as  the 
Roomba  (iRobot  Corporation,  2009b)  and  Rovio  (Evolution 


Robotics,  Inc.,  2009)  and  slightly  larger  ones  such  as  the  Pi¬ 
oneer  (MobileRobots  Inc.,  2009)  are  also  excluded  because 
of  their  lower  payload  capacity,  which  limits  the  amount  of 
computing  that  can  be  carried  to  a  single  laptop. 

Aside  from  mobility,  a  few  other  important  features 
contribute  to  the  usability  of  the  robot.  They  are  battery  life, 
sensors,  interfaces,  and  available  software.  An  ideal  bat¬ 
tery  system  would  be  one  that  enables  the  user  to  run  for  a 
whole  day  without  having  to  recharge.  The  two  factors  that 
matter  here  are  the  total  charge  carried  and  the  amount  of 
charge  required  to  operate  the  robot.  The  latter  is  dictated 
by  the  total  weight  of  the  robot  and  power  consumption  of 
the  computers  and  devices.  These  requirements  should  be 
decided  first.  On  that  basis,  the  former  can  then  be  adjusted 
by  selecting  the  proper  battery  system  (type  and  quantity). 

There  are  different  types  of  available  batteries:  NiCd, 
NiMH,  sealed  lead  acid  (SLA),  and  lithium  based.  The 
trade-off  is  that  of  cost,  dimensions,  and  durability.  For 
one,  SLA  batteries  are  the  most  economical  (in  terms  of 
cost-to-charge  ratio),  widely  available  when  it  comes  time 
to  replace  them,  and  robust  as  they  are  easy  to  maintain 
and  long  lasting.  However,  SLA  batteries  have  low  charge- 
to-weight  as  well  as  charge-to-volume  ratios  compared 
to,  particularly,  the  lithium-based  technologies.  Lithium 
batteries  are  lighter  and  more  compact  for  a  compara¬ 
ble  amount  of  charge  (National  Institute  of  Standards  and 
Technology,  2009).  However,  these  types  of  batteries  are 
much  more  expensive  and  fragile  than  the  very  rugged 
SLAs.  If  lithium  batteries  are  not  handled  carefully,  for  ex¬ 
ample  by  not  using  protective  circuits,  they  can  explode. 
Although  for  the  size  of  robot  we  are  considering  battery 
weight  is  not  as  much  an  issue,  note  that  volume  would 
still  matter  in  terms  of  placement. 

It  is  important  to  have  easy  access  to  the  battery  com¬ 
partment  so  that  we  do  not  have  to  unscrew  or  disassemble 
components  in  order  to  charge  the  batteries.  If  the  batteries 
can  be  charged  rapidly,  within  an  hour  or  two,  an  even  bet¬ 
ter  option  would  be  to  be  able  to  do  so  without  having  to  re¬ 
move  them  from  the  robot,  instead  using  a  docking  station 
or  a  wall  plug-in  outlet.  If  rapid  recharge  is  not  available,  a 
feature  to  hot  swap  the  batteries  as  in  Willow  Garage  (2009) 
to  avoid  shutting  down  computers  in  the  switching  process 
would  be  convenient. 

For  a  platform  to  be  applicable  for  a  wide  range  of 
robotics  and  vision  research,  most  commercial  robots  are 
furnished  with  a  variety  of  sensors  and  manipulation  tools 
such  as  a  robot  arm  and  also  provide  avenues  for  future 
expansion.  When  selecting  a  sensor,  we  look  for  compact, 
light,  low-power  devices  that  exhibit  high  accuracy  and 
high  update  rates.  Popular  sensors  such  as  laser  range 
finders,  sonar  rings,  cameras,  inertial  measurement  units 
(IMU),  global  positioning  systems  (GPS),  and  compasses 
should  be  considered  as  potential  accessories.  For  a  camera 
in  particular,  negligible  latency  is  a  must.  After  a  number  of 
experiments,  we  found  that  Firewire  (IEEE-1394)  cameras 
were  the  best  in  minimizing  delays,  more  so  than  Internet 
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protocol  (IP)  or  universal  serial  bus  (USB)  cameras.  In  ad¬ 
dition,  related  features  such  as  pan-tilt-zoom,  autofocus, 
image  stabilization,  low-light  capability,  and  wide-angle  or 
omnidirectional  viewing  setup  are  also  made  available  by 
various  companies. 

As  for  sensor  expansion,  aside  from  anticipating  the 
future  extra  payload,  it  is  also  important  to  have  many  ac¬ 
cessible  USB  ports  placed  throughout  the  body  of  the  robot 
and  USB-to-serial  converters  for  serial  devices,  as  well  as 
several  microcontrollers  that  can  preprocess  slower  input 
signals. 

Another  important  feature  is  having  multiple  types 
of  user  interface.  For  example,  USB  inputs  are  useful  to 
connect  a  keyboard  and  monitor  to  the  computers  in  the 
robot  to  allow  for  hardware  and  operating  system  recon¬ 
figuration.  In  addition,  many  robots  have  reasonably  sized 
(15-25  cm)  full  red-green-blue  (RGB)  liquid  crystal  display 
(LCD)  monitors  for  visualization  of  the  robot's  state  during 
test  runs.  Furthermore,  wireless  network  connections  for 
remote  secure  shell  (SSH)  logins  give  us  additional  flexibil¬ 
ity  to  allow  for  safe  and  faster  algorithm  on-site  debugging. 
At  the  same  time,  the  robot  can  use  the  external  connection 
to  access  outside  network  or  Internet  resources,  which  can 
be  useful  in  some  scenarios.  A  related  feature  in  this  cat¬ 
egory  is  a  standard  radio  frequency  (RF)  remote  controller 
for  stopping  the  robot  whenever  autonomous  driving  starts 
to  fail.  Furthermore,  most  robots  (Carnegie  Mellon  Univer¬ 
sity  Robotics  Institute,  2009;  MobileRobots,  Inc.,  2009)  are 
equipped  with  large  kill  switches  to  stop  the  flow  of  power 
to  the  motors. 

In  addition  to  the  hardware-related  aspects,  robotic 
companies  also  provide  software  libraries  to  conveniently 
access  all  the  included  devices  and  monitor  low-level  states 
such  as  battery  charge  and  temperature.  Some  companies 
(MobileRobots,  Inc.,  2009)  provide  further  value  additions 
such  as  mapping  and  navigation  tools  and  even  a  full¬ 
blown  simulation  environment.  We  list  the  common  soft¬ 
ware  offerings  in  Section  4,  where  we  describe  our  freely 
available  toolkit  (Itti,  2009). 


1«2«  Our  Approach 

Beobot  2.0  is  the  next  iteration  of  the  Beobot  system  de¬ 
veloped  in  our  lab  (Chung,  Hirata,  Mundhenk,  Ng,  Peters, 
et  al.,  2002).  The  original  Beobot  integrated  two  full-sized, 
dual-CPU  motherboards  for  a  total  of  four  1-GHz  proces¬ 
sors.  For  Beobot  2.0,  we  use  eight  dual-core  COM  systems. 
Fach  COM  measures  just  125  x  9.5  x  18  mm  and  nom¬ 
inally  consumes  only  24  W  of  power.  Nonetheless,  with 
a  2.2-GHz  dual-core  processor,  a  COM  has  the  comput¬ 
ing  power  equivalent  to  current  dual-core  laptop  systems. 
Despite  this  state-of-the-art  computing  platform,  we  have 
managed  to  keep  the  overall  cost  of  our  research-level, 
cluster-based  mobile  robot  to  under  $25,000  (detailed  in  Sia- 
gian  et  al.,  2009). 


One  aspect  of  a  COM  system  to  underscore  here  is  the 
ease  with  which  its  components  can  be  upgraded.  Because 
the  input  and  output  signals  are  routed  through  just  two 
high-density  connectors,  one  need  only  remove  the  current 
module  and  replace  it  with  an  upgraded  one.  Thus,  as  more 
and  more  powerful  processors  become  available,  Beobot 
2.0's  computer  systems  can  keep  pace,  making  it  somewhat 
more  resistant  to  the  rapid  obsolescence  that  is  characteris¬ 
tic  of  computer  systems.  The  ability  to  keep  pace  with  pro¬ 
cessor  technology  is  important  because  robotic  algorithms 
are  expected  to  continue  to  evolve  and  become  ever  more 
complex,  thus  requiring  commensurate  levels  of  comput¬ 
ing  power. 

Beobot  2.0's  computer  system  is  mounted  on  an  elec¬ 
tric  wheelchair  base  (Figure  1),  with  an  overall  size  that  is 
close  to  that  of  a  human.  This  allows  the  robot  to  navigate 
through  corridors  and  sidewalks  and  creates  an  embodi¬ 
ment  that  is  ideal  for  interacting  with  people.  We  assume 
that  the  majority  of  these  pertinent  locations  are  wheelchair 
accessible,  as  required  by  law.  We  believe  that  even  with 
this  locomotion  limitation,  there  are  still  enough  physically 
reachable  locations  to  perform  comprehensive  real-world 
experiments.  Figure  1  shows  the  finished  robot. 

The  rest  of  the  paper  is  organized  as  follows:  first,  we 
describe  the  electrical  system  in  Section  2  and  then  the  me¬ 
chanical  system  in  Section  3.  Section  4  goes  into  the  details 
of  our  software  library,  highlighting  the  advantage  of  im¬ 
plementing  a  computing  cluster  in  robotics  research. 

In  Section  5  we  examine  the  robot  on  various  impor¬ 
tant  operational  aspects,  the  most  important  of  which  is 
computational  speed/ throughput,  to  demonstrate  how  one 
could  benefit  from  such  a  complex  computing  cluster  ar¬ 
chitecture.  We  test  Beobot  2.0  using  three  benchmark  algo¬ 
rithms.  One  is  the  popular  SIFT  (Lowe,  2004)  object  recog¬ 
nition.  The  second  is  a  distributed  saliency  algorithm  (Itti, 
Koch,  &  Niebur,  1998),  which  models  the  visual  attention 
system  of  primates.  The  algorithm  operates  on  a  very  large 
image  of  4,000  x  4,000  pixels  and  returns  the  most  salient 
parts  of  the  image.  The  last  one  is  a  vision  localization  sys¬ 
tem  by  Siagian  and  Itti  (2009)  that  requires  the  system  to 
compare  a  detected  salient  landmark  input  with  a  large 
landmark  database  obtained  from  previous  visits.  All  of 
these  algorithms  are  part  of  the  Vision  Toolkit,  available 
freely  online  (Itti,  2009),  which  also  houses  Beobot  2.0's 
software  control  architecture,  including  obstacle  avoidance 
(Minguez  &  Montano,  2004)  and  lane  following  (Ackerman 
&  Itti,  2005).  We  then  summarize  our  findings  (in  Section  6) 
and  what  we  have  learned  through  the  process  of  building 
this  robot. 

2.  ELECTRICAL  SYSTEM  DESIGN 
AND  IMPLEMENTATION 

Figure  2  presents  an  overview  of  the  electrical  system.  On 
the  right-hand  side  of  the  figure,  there  are  two  baseboards, 
each  housing  four  COM  Fxpress  modules  (explained  in 
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Figure  1 .  Various  features  of  Beobot  2.0.  Beobot  2.0  utilizes  an  electric  wheelchair  platform  to  carry  a  high-performance  computing 
cluster  of  16  processor  cores,  2.2  GHz  each.  The  robot  is  equipped  with  various  sensors  such  as  Firewire  camera,  laser  range  finder, 
sonar  array,  IMU,  compass,  and  GPS.  In  addition,  panel-mount  waterproof  USB  connectors  are  available  for  new  sensors,  along 
with  RJ45  Ethernet  for  wired  Internet  connection  and  panel-mount  KVM  inputs  for  regular-sized  monitor,  keyboard,  and  mouse. 
There  is  also  a  touchscreen  LCD  for  a  convenient  user  interface.  Furthermore,  the  kill  switches  at  each  corner  of  the  robot  are 
available  as  a  last  resort  to  stop  it  in  emergency  situations. 


depth  below),  and  implementing  signals  such  as  gigabit 
Ethernet  for  the  backplane  intermodule  communication,  as 
well  as  others  such  as  SATA  (two  per  module),  PCI  Express, 
USB,  and  VGA.  Beobot  2.0  uses  a  PCI  Express  1394-Eirewire 
card  for  a  low-latency  camera  connection.  One  of  the  SATA 
ports  was  used  for  the  primary  hard  drive  and  the  other  for 
external  drives  such  as  CD-ROM  (useful  for  installing  op¬ 
erating  systems,  for  example).  Giving  each  module  its  own 
hard  drive  obviates  the  need  to  pass  around  copies  of  stored 
data,  such  as  large  knowledge  databases  obtained  during 
training. 

There  are  six  USB  signal  implementations  per  com¬ 
puter  for  a  total  of  48.  Some  of  them  are  being  used  for 
sensors  listed  in  Table  I.  Several  of  the  USB  connectors 
are  panel  mounted  outside  the  robot  for  ease  of  connect¬ 
ing  external  devices  using  dust-  and  waterproof  connectors 
(Eigure  1).  In  addition,  there  are  also  USB  connectors  inside, 
on  the  baseboards  (see  Eigure  3). 

Eurthermore,  there  is  an  onboard  keyboard-video¬ 
mouse  (KVM)  switch  to  toggle  between  each  of  the  eight 
computers.  The  KVM  is  an  eight-to-two  switch,  eight  com¬ 
puters  to  two  display  outputs.  The  display  signal  outputs 
can  be  either  a  regular-sized  external  monitor  or  to  an  on¬ 
board  8-in.  touchscreen  LCD  with  a  full-color  video  graphic 
array  (VGA)  interface  (Eigure  1).  Note  that  in  practice  we 
operate  all  computers  from  a  single  node  using  an  SSH  lo¬ 
gin  session  to  conveniently  run  and  monitor  multiple  pro- 


Table  L  Sensors  provided  in  Beobot  2.0. 


Item 

Company  name 

Reference 

Laser  range 

Hokuyo 

Hokuyo  Automatic  Co., 

finder 

Ltd.,  2009 

IMU 

MicroStrain 

MicroStrain,  Inc.,  2009 

Compass 

PNI 

PNI  Sensor  Corporation,  2009 

Sonars 

SensComp 

SensComp,  Inc.,  2009 

(7  units) 
GPS 

US  Global  Sat 

USGlobalSat,  Inc.,  2009 

grams  simultaneously.  The  use  of  wired  interface  to  the  in¬ 
dividual  computers  is  usually  limited  to  hardware,  BIOS, 
and  boot  troubleshooting. 

The  objectives  for  selecting  a  computing  platform  ap¬ 
propriate  for  the  robot  are  high  computing  power,  compact¬ 
ness,  and  low  energy  consumption.  To  have  close  to  maxi¬ 
mum  achievable  speed,  we  concentrate  on  the  X86  architec¬ 
tures  rather  than  far  less  powerful  CPU  types  such  as  ARM 
or  Xscale.  Within  the  X86  family,  we  select  the  mobile  pro¬ 
cessor  version  rather  than  its  desktop  counterpart  for  en¬ 
ergy  efficiency,  still  being  competitive  in  computing  power. 
By  the  same  token,  in  using  the  mobile  CPU  version,  the 
corresponding  embedded  systems  option  can  be  selected 
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Figure  2.  Beobot  2.0  electrical  system.  On  the  right-hand  side  of  the  diagram,  there  are  two  baseboards,  each  housing  four  COM 
express  modules  and  each  module  with  its  own  SATA  hard  drive.  The  backbone  intercomputer  communication  is  gigabit  Ethernet 
that  is  connected  through  a  switch.  For  visual  interface  to  individual  computers,  a  KVM  is  used  to  connect  to  either  an  8-in.  LCD 
touchscreen  or  an  external  monitor,  mouse,  and  keyboard.  In  addition,  a  PCI  Express-Firewire  interface  card  is  used  to  connect  to 
a  low-latency  camera.  The  other  sensors  are  connected  via  the  many  USB  connectors  that  are  panel  mounted  on  top  of  the  robot 
as  well  as  on  the  baseboard.  The  whole  system  is  powered  by  a  24- V  battery  circuit  supply  (with  kill  switches  for  safety  purposes) 
and  is  regulated  through  a  set  of  dedicated  Pico- ATX  power  modules.  The  same  battery  circuit  also  powers  the  motors  as  well  as 
the  liquid-cooling  system. 


for  the  mobile  platform  (regular-sized  motherboards  do  not 
usually  use  mobile  CPUs),  which  resolves  the  size  issue. 

In  the  family  of  embedded  systems,  there  are  two  types 
of  implementations.  The  first  family  of  systems  have  the  in¬ 
terfaces  already  implemented,  ready  to  use.  An  example  is 
the  ITX  form-factor  family  (pico-ITX,  nano-ITX,  mini-ITX) 
(Via,  2009).  The  drawback  is  that  the  provided  interfaces 
are  fixed.  They  may  not  be  the  specific  ones  that  are  needed, 
and  unused  connections  can  be  a  waste  of  size  as  we  can¬ 
not  customize  their  location  and  orientation.  In  addition,  by 
using  off-the-shelf  motherboards,  their  dimensions  have  to 
be  accommodated  in  the  design  specifications,  which  may 
also  limit  the  options  for  the  locomotion  platform. 

In  contrast,  the  second  type  of  embedded  systems,  the 
COM  concept,  provides  specifications  for  all  the  interfaces 
only  through  a  set  of  high-density  connectors.  These  spec¬ 
ifications  are  usually  defined  by  an  industry  consortium 


such  as  the  XTX-standard  (XTX  Consortium,  2009).  The 
actual  breakout  of  the  individual  signals  (such  as  gigabit 
Ethernet,  USB,  PCI  Express)  from  the  module  connectors  to 
the  outside  devices  (a  hard  drive,  for  example)  has  to  be 
done  on  a  custom-made  carrier  board.  By  building  custom 
baseboards,  the  overall  size  of  the  electronics  can  be  con¬ 
trolled  by  implementing  only  those  signals  that  we  actually 
need.  In  addition,  connector  placement  (as  well  as  type)  can 
be  specified  so  as  to  minimize  the  amount  of  cabling  in  the 
system. 

In  the  end,  we  found  that  a  COM  module  solution  best 
met  our  requirements,  which  we  stated  at  the  start  of  this 
section.  Within  this  class,  there  are  three  options:  ETX  (ETX 
Industrial  Group,  2009),  XTX  (XTX  Consortium,  2009),  or 
COM  Express  (COM  Express  Extension,  2009).  These  mod¬ 
ules  use  the  most  powerful  processors,  as  opposed  to  the 
smaller  but  less  powerful  systems  such  as  PC104-based 
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Figure  3.  Baseboard.  The  image  on  the  left  is  a  fully  assembled  baseboard  with  four  COM  Express  modules.  The  black  plates  are 
the  heat  spreaders  attached  to  the  processors.  There  is  also  an  Ethernet  and  two  USB  jacks  placed  on  the  right-hand  side  of  the 
board.  The  layout  on  the  right  is  the  circuit  done  in  Altium  (Altium  Limited,  2009)  PCB  design  software. 


Qseven  (Qseven  Standard,  2009).  We  chose  COM  Express 
because  it  has  an  onboard  gigabit  Ethernet  interface  on  the 
module,  and  it  is  only  slightly  larger  (12.5-cm  length  x 
9.5-cm  width)  than  the  XTX  and  ETX  module  (11.5-cm 
length  X  9.5-cm  width).  Gigabit  Ethernet  is  critical  because 
in  a  cluster  architecture,  intercomputer  communication  can 
be  just  as  important  as  the  computing  power  of  individ¬ 
ual  nodes.  If  the  communication  procedure  cannot  provide 
data  fast  enough,  the  individual  processors  will  simply  idle 
most  of  the  time,  waiting  for  data  to  arrive.  This  is  espe¬ 
cially  true  in  our  case  because  Beobot  2.0  is  designed  to 
perform  heavy-duty,  real-time  vision  computation  in  which 
the  real-time  video  streaming  is  much  more  demanding 
than  sending  intermediate  results. 

We  implemented  two  carrier/baseboards  (refer  to 
Figure  3),  each  accommodating  four  COM  Express  mod¬ 
ules.  A  total  of  eight  modules  is  chosen  because  the  com¬ 
puting  system  fits  within  the  mobile  platform  and  because 
this  number  is  expected  to  suffice  for  our  research  needs 
based  on  the  findings  presented  in  Section  5. 

We  used  the  Kontron  COM  Express  design  guide 
(Kontron,  2007)  [for  the  Kontron  ETX-Express-MC  2.2  GHz 
(T7500)  COM  Express  module  (Kontron,  2009)]  to  help 
properly  design  the  electronic  circuits  as  well  as  lay  out 
the  components  in  the  board.  We  used  the  electronics 
computer-aided  design  (ECAD)  layout  software  Altium 


(Altium  Limited,  2009)  to  plan  the  physical  placement  of 
all  the  desired  devices  and  connectors  with  as  little  cabling 
as  possible  for  a  system  of  eight  computers.  Album's  three- 
dimensional  (3D)  visualization  proved  to  be  an  invaluable 
feature  as  it  allowed  us  to  verify  that  boards  and  devices 
packed  close  together  in  the  robot  would  not  collide  or  oth¬ 
erwise  interfere  with  any  other  components. 

The  most  critical  part  in  successfully  implementing 
the  baseboards  was  being  able  to  take  care  of  the  high¬ 
speed  differential-pair  signal  requirements  such  as  match¬ 
ing  length  and  spacing,  as  well  as  minimizing  the  number 
of  vias  in  the  baseboard.  Altium  allows  its  users  to  spec¬ 
ify  rules  for  each  trace  on  the  board,  which  tremendously 
eases  the  process  of  identifying  unsatisfied  constraints.  We 
found  that  the  signals  are  quite  robust  as  long  as  the  stated 
requirements  are  adhered  to.  In  addition,  most  of  these  sig¬ 
nals  need  very  few  supporting  circuits.  The  most  compo¬ 
nents  required  by  a  signal  is  eight,  for  the  USB  current  lim¬ 
iter  (500  mA)  circuit.  The  VGA  signal  actually  specifies  that 
it  needs  a  filtering  circuit  with  many  components,  but  the 
KVM  chip  furnishes  this  feature. 

To  provide  clean  and  fail-safe  power  given  a  supply 
from  the  available  batteries,  a  Pico- ATX  (Ituner  Networks 
Corp.,  2009)  module  [see  Figure  4(a)]  is  used  to  regulate 
power  to  each  COM  Express  for  a  total  of  eight.  There  is 
also  one  extra  Pico- ATX  powering  all  the  peripheral  boards 


(a)  Pico-ATX  (b)  KVM  board  (c)  Liberty312  battery  and  charger 

Figure  4.  Various  power-related  components. 
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and  sensors.  There  are  three  peripheral  boards:  one  to  con¬ 
trol  the  cooling  system,  one  to  control  access  to  the  motors, 
and  a  sensor  board  that  houses  all  the  various  built-in  sen¬ 
sors.  The  power  to  the  drive-train  motors  does  not  need  to 
be  filtered  and,  thus,  is  directly  connected  from  the  batter¬ 
ies.  Because  power  supplies  have  a  high  rate  of  failure,  go¬ 
ing  with  multiple  power  modules  provides  better  granular¬ 
ity  in  that  if  one  module  fails,  the  remaining  seven  comput¬ 
ers  can  still  run.  Additionally,  the  lower  individual  supply 
requirement  allows  for  a  wider  range  of  products  to  choose 
from  than  would  have  been  available  in  a  single  module 
solution. 

Table  II  summarizes  the  important  electrical  features: 
inter  processor  communications,  input /output  interfaces, 
KVM  interface,  sensors,  and  power  management.  These 
features  are  shown  to  be  critical  while  compiling  the  list  of 
available  commercial  robots  as  well  as  from  our  experience 
conducting  robotics  research. 

3.  MECHANICAL  SYSTEM  DESIGN 
AND  IMPLEMENTATION 

The  mechanical  design  of  the  robot  is  divided  into  two 
parts:  the  locomotion  platform,  which  is  the  dark-colored 
robot  base  in  Figure  5  and  described  in  Section  3.1,  and  the 
computing  cluster  housing,  which  is  the  cardinal-colored 
structure  described  in  Section  3.2. 

Again,  note  that  we  created  a  wiki  page  (Siagian 
et  al.,  2009)  to  detail  the  execution  matters  such  as  actual 
part  drawings  (SolidWorks  Corp.,  2009),  part  manufactur¬ 
ing  through  a  machine  shop,  or  finding  suppliers  for  the 
needed  devices  listed  in  the  bill  of  material. 

3 A «  Locomotion  System 

For  the  locomotion  platform,  we  selected  a  Liberty  312  elec¬ 
trical  wheelchair  (Major's  Mobisist,  2009)  instead  of  build¬ 
ing  one  from  scratch.  Often  priced  at  thousands  of  dol¬ 
lars,  these  types  of  units  are  easily  acquired  second  hand 
through  channels  such  as  Craigslist  or  eBay  (ours  cost  U.S. 
$200).  The  wheelchair  is  a  robustly  engineered,  stable,  safe, 
and  low-maintenance  platform.  Most  importantly,  adher¬ 
ing  to  the  wheelchair  form  factor  allows  the  robot  to  tra¬ 
verse  most  terrain  types  encountered  in  modern  urban  en¬ 
vironments,  both  indoors  and  out.  This  platform  can  also 
carry  heavy  payloads  (113-136  kg),  which  means  the  ability 
to  add  many  more  devices  to  the  robot's  computing  cluster. 
An  important  factor  to  consider  is  the  ability  to  control  the 
motors  over  a  wide  range  of  speeds  (0-16.09  km/h)  with 
good  resolution  in  between.  The  wheelchair  platform  has 
this  characteristic  as  it  is  designed  for  fine-grained  control, 
as  opposed  to  the  remote  control  (RC)  car  used  by  the  origi¬ 
nal  Beobot  (Chung  et  al.,  2002),  which  could  be  driven  only 
at  maximum  speed.  Another  benefit  of  the  wheelchair  is 
that  it  places  the  computing  cluster  on  top,  relatively  high 
above  the  ground  (about  50  cm)  and  away  from  the  thick 


dust  and  mud  that  can  accumulate  on  the  street.  Note  that 
the  robot's  driving  dynamics  is  taken  care  of  because  the 
wheelchair  is  designed  to  have  a  person  on  top,  where  the 
computing  system  now  is  placed.  This  is  accomplished  by 
the  wide-spacing  configuration  of  the  wheels,  enveloping 
the  payload,  to  allow  for  the  overall  balance  of  the  system 
while  it  is  moving  reasonably  fast.  In  addition,  the  heavy 
SLA  batteries  are  placed  on  the  bottom  to  lower  the  center 
of  mass. 

To  control  the  wheelchair,  we  designed  a  motor  board 
to  connect  the  battery  and  motors  to  inputs  from  the  com¬ 
puter  for  autonomous  control  as  well  as  to  a  2.4-GHz  re¬ 
mote  controller  (RC)  for  manual  driving  or  overriding.  A 
dual-output  motor-driver  named  Sabertooth  (Dimension 
Fngineering  LLC,  2009)  is  used  to  provide  up  to  25  A  to 
each  motor.  In  addition,  because  the  motor  driver  has  a 
built-in  electrical  brake  system,  the  mechanical  brakes  that 
stop  the  motors  by  pinching  the  back  shafts  are  taken  off. 
This  then  allows  the  back  shafts  to  be  coupled  to  a  pair  of 
encoders  to  provide  odometry  data.  As  a  safety  precaution, 
Beobot  2.0  is  furnished  with  four  kill  switches  (Figure  1), 
one  on  each  corner  for  the  user  to  immediately  stop  the 
robot  in  the  event  of  an  emergency. 

The  wheelchair  comes  with  a  pair  of  12-V,  35- Ah  SLA 
batteries,  connected  in  series  to  provide  a  24-V  supply.  They 
have  a  form-factor  space  of  19.5-cm  length  x  13.2-cm  width 
X  15.5-cm  height  for  each  battery.  An  attractive  feature  of 
the  wheelchair  is  the  built-in,  wall-outlet,  easy-plug-in  bat¬ 
tery  recharging  system,  shown  in  Figure  4(c).  With  this,  the 
batteries  can  be  conveniently  recharged  without  having  to 
put  them  in  and  take  them  out  of  the  robot,  although  the 
recharging  process  does  take  an  average  of  10  h. 

3«2«  Computing  Cluster  Case 

The  structure  surrounding  the  computing  clusters,  as 
shown  in  Figure  5,  shields  the  computing  cluster  from  un¬ 
wanted  environmental  interference  such  as  dust  and  mud. 
The  structure  is  divided  into  two  isolated  chambers  as  illus¬ 
trated  in  the  figure.  The  back  chamber  is  the  watertight  area 
where  the  cluster  is  placed.  The  front  chamber  is  an  open 
area,  reserved  for  a  liquid-cooling  system  (further  elabo¬ 
rated  in  Section  3.2.2),  which  includes  a  radiator  to  allow 
for  maximum  air  flow.  These  two  cooling  subsystems  are 
connected  through  Tygon  tubing  for  liquid  flow  and  are 
physically  held  together  by  a  pair  of  aluminum  holders. 
The  computing  cluster,  along  with  the  cooling  system,  itself 
is  mounted  on  shock-absorbing  standoffs  (Section  3.2.1)  to 
withstand  violent  collision  in  the  rare  event  the  robot  hits 
an  obstacle. 

3.2.1 .  Vibration  Attenuation  and  Shock 
Absorption  System 

As  illustrated  in  Figure  5,  the  only  connections  be¬ 
tween  the  computing  system  and  the  robot  base  are  the 
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Shock  Absorber 


LibertySI  2  Wheelchair  Base 


[  Dust-proof  Firewall  Seal  ] 


Cooling  Block 


Robot  Body  Case 


Radiator 


Fans 


Figure  5.  A  SolidWorks  (SolidWorks  Corp.,  2009)  model  of  the  robot  shown  from  its  side  and  cut  through  its  center.  The  bottom 
of  the  image  displays  the  LibertySI 2  wheelchair  base,  and  above  it  is  the  robot  body  in  cardinal  color.  The  robot  body  is  divided 
into  two  chambers  by  the  dust-proof  firewall.  The  back  chamber  completely  seals  the  computers  within  from  the  elements.  The 
front  of  the  robot,  which  houses  part  of  the  cooling  system,  is  open  to  allow  for  heat  dissipation.  The  heat  from  the  computers 
is  transferred  by  the  liquid  in  the  cooling  block,  which  is  attached  to  the  heat  spreaders  on  each  module.  The  liquid  then  moves 
through  the  radiator,  which  is  cooled  by  the  fans,  before  going  back  to  the  cooling  block.  In  addition,  the  computing  block  is  shock 
mounted  on  four  cylindrical  mounts  that  are  used  to  absorb  shocks  and  vibration. 


shock-and-vibration  damping  standoffs.  This  makes  it  eas¬ 
ier  to  properly  evaluate  the  necessary  damping  require¬ 
ments.  When  considering  a  damping  solution,  one  needs  to 
take  into  account  the  basic  relationship  between  shock  and 
vibration.  That  is,  the  solution  has  to  be  rigid  enough  to  not 
cause  too  much  vibration  on  the  load  but  flexible  enough 
to  absorb  shocks.  Here,  the  focus  is  more  on  shock  because, 
like  regular  laptops,  the  computers  should  be  able  to  work 
despite  the  vibration  that  comes  from  reasonably  rough  ter¬ 
rains.  In  addition,  the  system  uses  solid-state  hard  drives 
(SSD),  which  have  no  moving  parts  and  can  withstand  far 
more  shock  than  their  mechanical  counterparts. 

The  natural  rubber  cylindrical  mounts  are  selected 
over  other  options  such  as  wire-rope  isolators,  rubber  or 
silicone  pads,  and  suspension  springs  because  of  their  com¬ 
pactness.  In  addition,  the  height  of  the  standoffs  is  eas¬ 
ily  adjustable  by  screwing  together  additional  absorbers 
according  to  needs.  Furthermore,  one  can  change  their 
shock  absorption  property  by  adding  washers  between  two 
mounts  if  need  be. 

3.2.2.  Cooling  System 

Because  Beobot  2.0  is  meant  to  be  used  both  indoors  and 
outdoors,  we  decided  against  an  air-cooling  system  due  to 
the  possibility  of  the  fans  pushing  dust  into  the  exposed 


electronics  inside,  although  air  filters  could  have  kept  the 
dust  out.  However,  the  electronics  would  have  to  be  placed 
in  an  area  where  air  flow  is  well  controlled,  i.e.,  air  must 
be  drawn  in  and  exhausted  out  only  through  the  fans.  This 
would  have  entailed  a  push-and-pull  fan  system  and  sig¬ 
nificant  prototyping  and  rework  of  the  mechanical  system. 

Therefore,  we  settled  on  a  liquid-cooling  solution. 
Moreover,  as  water  has  30  times  the  amount  of  thermal  con¬ 
ductivity  and  four  times  the  amount  of  heat  capacity  as  air 
(Callister,  2003),  a  liquid-cooling  system  is  more  effective  in 
addition  to  being  cleaner. 

The  liquid-cooling  system,  as  shown  in  Figure  5,  con¬ 
sists  of  the  following  components:  cooling  block,  tubes, 
nozzles,  radiator,  two  fans,  liquid  pump,  reservoir,  cool¬ 
ing  liquid,  a  flowmeter,  and  a  temperature  sensor  to  moni¬ 
tor  the  system.  Note  that  the  system  uses  a  cooling  control 
board  to  provide  power  for  the  fans  and  the  pump,  as  well 
as  to  take  data  from  the  flowmeter  and  temperature  sensor. 

The  heat  dissipated  by  the  COM  Fxpress  modules  is 
first  transferred  to  the  liquid  coolant  through  the  proces¬ 
sors'  heat  spreaders  that  are  firmly  pressed  up  against 
the  top  and  bottom  of  the  cooling  block,  which  contains 
the  coolant.  We  recommend  using  a  high-performing,  low- 
conducting,  noncorrosive  coolant  for  a  maintenance-free 
system.  The  heat-carrying  coolant  first  goes  through  the  ra¬ 
diator,  which  has  two  fans  pulling  air  through  the  radiator 
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surface.  These  fans  are  the  devices  that  actively  take  the 
heat  out  of  the  system.  Note  that  the  radiator  (and  the  fans) 
can  be  placed  as  far  away  from  the  processors  as  necessary. 
The  liquid  pump  is  connected  to  the  system  to  ensure  the 
flow  of  the  liquid.  Finally,  a  reservoir  is  included  to  add 
the  coolant  into  the  system  and  to  take  the  air  (bubbles) 
out  of  it. 

4.  SOFTWARE  DESIGN 

Our  ultimate  goal  is  to  implement  a  fully  autonomous  em¬ 
bodied  system  with  complete  visual  scene  understanding. 
To  do  so,  we  lay  the  groundwork  for  a  robot  develop¬ 
ment  environment  (Kramer  &  Scheutz,  2002)  that  especially 
maximizes  the  multiple-processor  hardware  architecture. 
In  addition  it  fulfills  the  primary  objective  in  designing  the 
software,  viz.,  to  be  able  to  integrate  and  run  computation¬ 
ally  heavy  algorithms  as  efficiently  as  possible.  The  advan¬ 
tage  of  using  COM  Express  modules  as  a  platform  is  that 
they  can  be  treated  as  regular  desktops.  This  allows  the 
use  of  a  Linux  operating  system  in  conjunction  with  C-h+ 
rather  than  some  special-purpose  environment.  Note  that, 
this  way,  the  user  can  install  any  kind  of  Linux-compatible 
software  tools  that  he /she  prefers,  not  just  the  ones  that 
we  suggest  below.  Also,  although  this  is  not  a  true  real¬ 
time  system,  it  is  quite  adequate  for  our  needs,  with  the 
control  programs  running  reasonably  fast  and  the  robot  re¬ 
sponding  in  real  time.  In  case  a  user  would  like  to  go  with 
a  real-time  operating  system,  several  Linux-based  options 
and  extensions  are  available  (Politecnico  di  Milano,  2010; 
QNX  Software  Systems,  2010;  Wind  River,  2010;  Xenomai, 
2010). 

To  speed  up  the  development  of  the  complex  algo¬ 
rithms  mentioned  above,  we  use  the  freely  available  iLab 
Neuromorphic  Vision  C++  Toolkit  (Itti,  2009).  The  motiva¬ 
tion  for  the  toolkit  is  to  facilitate  the  recent  emergence  of 
a  new  discipline,  neuromorphic  engineering,  which  chal¬ 
lenges  classical  approaches  to  engineering  and  computer 
vision  research.  These  new  research  efforts  are  based  on  al¬ 
gorithms  and  techniques  inspired  from  and  closely  repli¬ 
cating  the  principles  of  information  processing  in  biolog¬ 
ical  nervous  systems.  Their  applicability  to  engineering 
challenges  is  widespread  and  includes  smart  sensors,  im¬ 
planted  electronic  devices,  autonomous  visually  guided 
robotics  systems,  prosthesis  systems,  and  robust  human- 
computer  interfaces.  Thus,  the  development  of  a  neuromor¬ 
phic  vision  toolkit  helps  provide  a  set  of  basic  tools  that  can 
assist  newcomers  in  the  field  with  the  development  of  new 
models  and  systems. 

Because  of  its  truly  interdisciplinary  nature,  the  toolkit 
is  developed  by  researchers  in  psychology,  experimen¬ 
tal  and  computational  neuroscience,  artificial  intelligence, 
electrical  engineering,  control  theory,  and  signal  and  im¬ 
age  processing.  In  addition,  it  aids  in  integration  with  other 
powerful,  freely  available  software  libraries  such  as  Boost 
and  OpenCV. 


The  project  aims  to  develop  next-generation  general  vi¬ 
sion  algorithms  rather  than  being  tied  to  specific  environ¬ 
mental  conditions  or  tasks.  To  this  end,  it  provides  a  soft¬ 
ware  foundation  that  can  be  used  for  the  development  of 
many  neuromorphic  models  and  systems  in  the  form  of 
a  C++  library  that  includes  classes  for  image  acquisition, 
preprocessing,  visual  scene  understanding,  and  embodied 
system  control. 

These  systems  can  be  deployed  in  a  single  machine  or 
a  distributed  computing  platform.  We  use  the  lightweight 
middleware  ICE  (Internet  Communication  Engine)  via  its 
C++  library  bindings  to  facilitate  intercomputer  commu¬ 
nication  with  a  high-level  interface  that  abstracts  out  low- 
level  matters  such  as  marshaling  data  and  opening  sock¬ 
ets.  Sensors /devices,  which  are  connected  to  a  computer 
in  the  distributed  system,  are  encapsulated  as  independent 
services  that  publish  their  data.  Different  systems  can  grab 
just  the  sensor  outputs  that  they  need  by  subscribing  to  that 
particular  service.  In  addition,  such  a  distributed  system  is 
fault  tolerant  as  nonfunctional  services  do  not  bring  down 
the  whole  system.  We  are  also  working  on  adding  func¬ 
tionality  to  quickly  detect  nonresponding  hardware  and  re¬ 
cover  from  failures  by  performing  an  ICE  reconnection  pro¬ 
tocol,  for  example. 

Another  aspect  to  pay  close  attention  to  is  the  need  for 
robust  debugging  tools  for  distributed  systems  that  are  pro¬ 
vided  by  the  toolkit  as  well  as  future  applications.  That  is, 
we  would  like  to  know  which  modules  in  the  system  take 
the  longest  times,  which  ones  send  the  largest  amount  of 
data,  and  how  all  these  factors  affect  the  overall  system  effi¬ 
ciency.  Currently,  the  system  has  logging  facilities  for  anal¬ 
ysis  after  a  testing  run  has  taken  place.  What  would  be  ideal 
is  an  online  monitoring  system. 

In  terms  of  hardware  support,  the  toolkit  has  exten¬ 
sive  source  code  available  for  interfacing  sensors  through 
different  avenues.  For  example,  Beobot  2.0  currently  can 
connect  to  different  types  of  cameras:  USB,  Firewire,  or  IP. 
Other  devices  that  use  a  serial  protocol  should  also  be  eas¬ 
ily  accommodated.  In  addition,  it  is  important  to  note  that 
the  separation  of  hardware-related  and  algorithm-related 
code  comes  naturally.  This  allows  the  user  to  test  most  of 
the  software  in  both  the  robot  and  our  custom  simulator 
(provided  in  the  toolkit)  without  too  many  changes.  Fur¬ 
thermore,  the  same  robot  cluster  computing  design  is  used 
for  our  robot  underwater  and  aerial  vehicles.  We  find  that 
porting  the  algorithms  to  the  other  robots  is  done  quite 
easily. 

Table  III  lists  all  the  vision-  and  robotic-related  soft¬ 
ware  capabilities  provided  by  the  toolkit. 

5.  TESTING  AND  RESULTS 

We  examine  a  few  aspects  of  Beobot  2.0.  The  first  is  ba¬ 
sic  functionality  such  as  power  consumption,  the  cooling 
system,  and  mobility  as  it  pertains  to  shock  absorption. 
The  power  consumption  testing  shows  the  typical  length 
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Table  ML  The  vision  toolkit  features. 


Features 

Description 

Available  options 

Devices 

Interface  code  for  various 
devices 

Embedded  systems/ microcontrollers,  joystick,  keyboard, 

gyroscope,  wii-mote,  GPS,  IMU  (HMR3300,  MicroStrain  3DM 
GX2),  LRF  (Hokuyo) 

Robots 

Control  code  for  various 
robots 

Scorbot  robot  arm.  Evolution  Robotics  Rovio,  Irobot  Roomba, 
Beobot,  Beobot  2.0,  BeoSub  Submarine,  BeoHawk  Quadrotor 
aerial  robots,  Gumbot  for  undergraduate  introduction  to  robotics 

Robotics  algorithm 

Modular  mobile  robotics 
algorithm 

Localization,  laser,  and  vision  navigation  (lane  following,  obstacle 
avoidance),  SLAM 

Distributed 

programming  tools 

Allows  programs  to 
communicate  between 
computers 

CORBA,  Beowulf,  ICE 

Neuromorphic  vision 

Biologically  plausible  vision 

Center-surround  feature  maps,  attention/ saliency  (multithreaded. 

algorithms 

algorithms 

fixed  point /integer),  gist,  perceptual  grouping,  contour 
integration,  border  ownership,  focus  of  expansion,  motion 

Media 

Access  to  various  input 
media 

mpeg,  jpeg,  cameras  (USB,  IP,  IEEE1394  Firewire),  audiovisual 

Image  processing 

Various  tools  to  manipulate 
images 

Drawing,  cut/ paste,  color  operations  [hue  saturation  value  (HSV), 
RGB,  etc.],  statistical  operations,  shape  transformation, 
convolutions,  Fourier  transform,  pyramid  builder,  linear 
algebra/  matrix  operation 

Machine  learning 

Tools  for  pattern  recognition 
training 

K-nearest-neighbor,  backpropagation  neural  networks,  support 
vector  machine,  genetic  algorithm 

Object  recognition 

Visual  object  recognition 
modules 

SIFT,  UMAX 

of  operation  given  the  amount  of  capacity  of  the  batteries 
and  the  weight  that  the  motors  have  to  move  and  the  eight 
computers  that  the  batteries  have  to  power.  Beobot  2.0  has 
a  power  supply  of  35  Ah  x  24  V  capacity  from  two  12-V 
SLA  batteries  in  series.  The  robot  is  run  with  full-load  com¬ 
puting  by  running  a  vision  localization  system  (Siagian  & 
Itti,  2009),  explained  in  Section  5.2,  while  the  robot  is  run 
around.  In  the  testing,  the  cooling  system  is  shown  to  drain 
about  1.8  A  of  the  24- V  supply,  whereas  the  gigabit  switch 
and  other  sensors  consume  about  0.5  A.  Each  of  the  eight 
computers  pulls  up  to  0.7  A  during  heavy  use,  and  the  mo¬ 
tors  pull  2  A  when  the  robot  is  moving  at  about  1.61  km/h. 
The  total  comes  up  to  9.9  A  in  regular  use,  which  corre¬ 
sponds  to  about  3.5  h  of  expected  peak  computation  run¬ 
ning  time. 

The  good  news  is  that  Beobot  2.0  has  two  accessible 
power  jacks  located  on  its  back,  in  the  KVM  board,  as 
shown  in  Figure  4(b).  By  plugging  in  an  auxilliary  power 
source  that  stops  its  current  flow  when  it  detects  another 
supply  in  the  system,  we  can  perform  hot  swapping  to  tem¬ 
porarily  replace  the  SLA  batteries.  This  prolongs  the  run¬ 
ning  time  considerably,  given  that  on-site  system  debug¬ 
ging  occurs  quite  often.  Consequently,  the  running  time  be¬ 
comes  actual  testing  time,  without  debugging  time.  This, 
for  the  most  part,  allows  users  to  do  research  on  site  for  the 
whole  day  and  charge  all  night. 

Table  IV  summarizes  the  results. 


We  then  go  into  the  usability  of  the  system  by  re¬ 
porting  our  experience  implementing  the  nearness  diagram 
(ND)  navigation  system  (Minguez  &  Montano,  2004)  in  Sec¬ 
tion  5.1.  Note  that  this  section  is  included  to  show  that  the 
robot  can  move  about  an  environment  and  is  ready  for  use. 
We  do  not  try  to  optimize  the  implementation  to  improve 
the  performance.  On  the  other  hand,  in  Section  5.2,  we  de¬ 
scribe  our  experiment  performing  three  computationally 
intensive  algorithms:  the  SIFT  (Lowe,  2004)  object  recogni¬ 
tion  system,  distributed  visual  saliency  (Itti  et  al.,  1998),  and 
the  robot  vision  localization  system  (Siagian  &  Itti,  2009). 
These  computational  speed /throughput  experiments  test 
the  most  critical  aspect  of  the  project's  objectives.  Given  the 
complexity  of  having  to  implement  a  cluster  of  processors, 
we  would  like  to  see  a  good  payoff  for  all  our  hard  work. 


5«1 «  Navisation  Algorithm  Implementation 

In  this  section,  we  test  the  first  algorithm  to  successfully  run 
on  Beobot  2.0,  viz.,  the  ND  navigation  algorithm  (Minguez 
&  Montano,  2004),  which  uses  a  laser  range  finder  to  build 
a  proximity  map  around  the  robot  and  then  searches  this 
map  for  the  navigable  region  closest  to  a  goal  location.  A 
navigable  region  is  a  space  or  area  that  is  at  least  as  wide  as 
the  robot,  thus  enabling  it  to  pass  through.  For  example,  the 
system's  graphical  user  interface  (GUI)  display  in  Figure  6 
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Table  lY^  Beobot2.0  subsystem  testings. 


Subsystem 

Tests 

Results 

Remarks 

Liquid  cooling 

CPUs  at  full  load  at  room 
temperature  of  22°  C 

Average  CPU 

temperature  of  41°C 

System  is  virtually  maintenance  free, 
although  it  consumes  1 .8  A  for  the 
liquid  pump;  CPUs  reach  critical 
overtemperature  within  15  min  if 
liquid-cooling  system  is  turned  off 

Mobility  and  shock 

Running  the  robot 

Computers  run  smoothly 

We  modify  the  motor  controller  code  to 

absorption 

throughout  the 
campus  using  RF 
controller  at  2  m/s 

without  disconnection 
through  several 
bumps  and  abrupt 
stops  whenever  the 
robot  is  too  close  to 
nearby  pedestrians 

properly  ramp  down  when  going  to  a 
complete  stop 

Battery  consumption 

Running  the  robot  using 
the  remote  controller 
with  all  computers 
running  full 
computations 

Robot  runs  for  2.25  h 
before  one  CPU  shuts 
down 

Can  prolong  the  testing  time 
considerably  by  hot  swapping 
batteries  (there  is  a  jack  at  the  back  of 
robot)  during  on-site  debugging; 
consequently,  the  running  time 
becomes  actual  testing  time,  without 
debugging  time,  and  allows  us  to  do 
research  on  site  for  the  whole  day 

shows  the  robot's  surroundings,  divided  into  nine  distinct 
regions. 

The  robot  follows  a  series  of  binary  decision  rules  that 
classify  all  situations  into  five  mutually  exclusive  cases, 
which  are  summarized  in  Table  V.  Each  case  is  associated 
with  a  corresponding  movement  action. 

First,  we  define  a  security  zone  around  the  robot  that 
is  an  area  twice  the  robot's  radius.  In  the  GUI  display 
(Figure  6),  this  zone  is  denoted  by  the  light  (yellow)  cen¬ 
ter  circle.  If  there  are  obstacles  within  the  security  zone  (red 
dots  within  the  circle  in  the  figure),  there  are  two  cases  to 
consider:  whether  there  are  obstacles  on  both  sides  of  the 
robot  or  only  on  one  side.  In  the  former  case,  the  robot  tries 
to  bisect  this  opening;  in  the  latter  case,  it  can  move  more 
freely  to  the  open  side.  Note  that  the  system  considers  only 
obstacles  that  are  within  60  deg  (between  the  two  red  lines 
in  Figure  6)  of  the  robot's  direction  of  motion  (blue  line  in 
the  figure). 

When  there  are  no  obstacles  in  the  security  zone,  it  con¬ 
siders  three  possible  situations.  If  the  goal  location  is  in  the 
navigable  region,  just  go  to  it.  If  the  goal  location  is  not  in 
the  navigable  region  but  the  region  is  wide  (only  one  obsta¬ 
cle  on  one  of  the  sides),  maneuver  through  the  wide  region, 
in  the  hope  that  there  is  a  way  to  go  to  the  goal  region  in 
the  following  time  step.  If  goal  location  is  not  in  the  naviga¬ 
ble  region  and  the  region  is  narrow  (between  two  obstacles, 
one  on  each  side),  carefully  move  forward  in  the  middle  of 
the  region.  The  overall  resulting  behavior  is  that  the  robot 
should  continuously  center  itself  between  obstacles,  while 
going  to  the  goal. 


To  test  this  algorithm,  only  two  of  the  available  eight 
computers  are  needed.  The  laser  range  finder  is  plugged 
into  one  computer  and  the  motor  control  board  into  the 
other.  Additionally,  a  RC  setup  allows  the  user  to  change 
from  autonomous  to  manual  mode  at  the  flick  of  a  switch 
in  case  the  robot  is  about  to  hit  something  or  has  been  stuck 
in  a  corner  for  some  time. 

During  implementation  and  debugging,  a  few  notable 
features  speed  up  the  process.  First,  the  8-in.  LCD  screen 
allows  users  to  observe  the  system  states  and  action  deci¬ 
sions  as  the  robot  is  moving.  Second,  the  use  of  a  wireless 
USB  keyboard  and  touch  pad  made  it  fairly  easy  to  issue 
new  commands  while  the  robot  was  working.  Last,  but  not 
least,  taking  the  time  to  set  up  an  intuitive  GUI  paid  back 
dividends  very  quickly  as  it  made  it  much  easier  to  un¬ 
derstand  what  was  going  on  and  how  to  fix  the  problems 
encountered. 

The  system  is  tested  indoors,  on  a  20  x  24  ft  empty 
area.  We  then  occupy  some  of  the  regions  with  obstacles 
and  test  Beobot  2.0  to  see  whether  it  can  navigate  from  one 
side  of  the  environment  and  back.  Figure  7  shows  a  snap¬ 
shot  of  the  environment  setup  for  the  experiment.  In  ad¬ 
dition,  some  of  the  obstacle  configurations  are  shown  in 
Figure  8,  with  an  example  odometry  trace  overlaid  on  top. 

There  are  nine  different  obstacle  configurations  and 
robot  starting  positions  in  the  testing  protocol.  Each  test 
was  performed  10  times,  with  the  robot's  speed  being  the 
only  variable  parameter.  We  vary  the  speed  between  ap¬ 
proximately  0.3  and  2.5  m/ s.  Table  VI  summarizes  the  re¬ 
sults  of  each  trial.  For  the  most  part,  the  navigation  system 


Journal  of  Field  Robotics  DOI 10. 1002 /rob 


Siagian  et  al:  Beohot  2.0:  Cluster  Architecture  for  Mobile  Robotics 


291 


Table  y.  ND  rules. 


Number 

Situation 

Description 

Action 

1 

Low  safety  1 

Only  one  side  of  obstacles  in  the 
security  zone 

Turn  to  the  other  side  while 

maintaining  the  angle  to  the  goal 
location 

2 

Low  safety  2 

Both  side  of  obstacles  in  the  security 
zone 

Try  to  center  between  both  side  of 
obstacles  and  maintain  the  angle  to 
the  goal  location 

3 

High  safety  goal  in  region 

All  obstacles  are  far  from  the  security 
zone  and  goal 

Directly  drive  toward  the  goal 

4 

High  safety  wide  in  region 

All  obstacles  are  far  from  the  security 
zone  but  goal  is  not  in  this  region 

Turn  half  max  angle  away  from  closest 
obstacles 

5 

High  safety  narrow  in  region 

All  obstacles  are  far  from  the  security 
zone  and  narrower  region  in  the 
goal  location 

Center  both  side  of  closest  obstacles 

performs  very  well,  with  a  72%  success  rate.  Here  success 
is  defined  as  the  robot  moving  from  its  starting  side  of  the 
environment  to  the  other  and  back  without  touching  any  of 
the  entities  surrounding  it. 


Figure  6.  GUI  of  the  ND  navigation  system.  The  system  iden¬ 
tifies  nine  (indexed  from  0  to  8,  note  that  4  and  7  are  cut  off  as 
they  are  drawn  outside  the  frame  of  the  GUI)  different  regions. 
The  robot  next  direction  of  motion  is  denoted  by  the  dark  line 
next  to  label  5.  The  two  lines  next  to  the  dark  line  delineate  the 
boundaries  of  the  navigable  region.  The  red  line  indicates  the 
directions  60  deg  to  the  left  and  right  of  the  robot's  next  di¬ 
rection.  We  also  display  the  robot's  translational  and  rotational 
motor  command.  Both  of  these  numbers  range  from  —1.0  to  1.0 
(negative  values  indicate  moving  backward  and  counterclock¬ 
wise  rotation,  respectively). 


Figure  7.  Snapshot  of  the  constructed  environment  for  ND 
navigation  testing. 


•••••• 


(a)  Envl  (b)  Env2  (c)  Env3 

Figure  8.  Various  environments  for  ND  navigation  testing 
with  an  example  path  that  is  taken  by  Beobot  2.0  using  the  ND 
navigation  algorithm. 

Of  the  total  of  90  trials,  25  resulted  in  failures  of  some 
sort.  Although  this  might  seem  excessive,  it  should  be 
pointed  out  that  the  majority  of  the  these  collisions  were  of 
the  type  in  which  Beobot  2.0  only  scraped  an  obstacle.  This 
occurs  whenever  the  robot  has  to  turn  sharply  to  avoid  an 
obstacle,  which  causes  its  rear  to  scrape  the  obstacle.  This  is 
a  minor  problem  that  can  be  easily  rectified  by  some  simple 
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Table  YL  Beobot  2.0  ND  navigation  testing. 


End  result 

Occurrence 

Percentage 

Success 

65 

72.22 

Scraping  the  obstacles 

16 

17.78 

Stuck  in  corner  or  circles 

7 

7.78 

Squarely  hit  an  obstacle 

2 

2.22 

control  fix;  e.g.,  when  a  turn  is  judged  to  be  sharp,  first  back 
up  a  little. 

There  were  two  occasions  when  Beobot  actually  hit 
an  obstacle  head-on.  This  happened  when  the  robot  was 
running  at  its  maximum  speed.  Under  this  circumstance, 
the  latency  of  the  system  (the  laser  range  finder  through¬ 
put  is  25  ms  or  40  Hz,  and  processing  is  approximately  the 
same  duration  as  well)  is  simply  too  large  to  allow  a  timely 
reaction. 

Finally,  there  were  seven  occasions  when  the  robot  be¬ 
came  stuck  in  a  corner  or  kept  spinning  in  place  because  it 
kept  alternating  between  left  and  right.  The  solution  to  this 
problem  requires  going  beyond  the  simple  reactive  nature 
of  the  navigation  system  and  figuring  out  what  is  globally 
optimal  by  integrating  knowledge  from  localization  or  si¬ 
multaneous  localization  and  mapping  (SLAM)  algorithms. 

5«2«  Computational  Capabilities 

In  this  section,  we  characterize  the  computing  platform  by 
running  three  computationally  intensive  vision  algorithms: 
SIFT  (Lowe,  2004)  object  recognition  system,  the  distributed 
visual  saliency  algorithm  (Itti  et  al.,  1998),  and  the  biolog¬ 
ically  inspired  robot  vision  localization  algorithm  (Siagian 
&  Itti,  2009).  These  algorithms  have  a  common  character¬ 
istic  in  that  their  most  time-consuming  portions  can  be 
parallelized,  whether  it  be  distributing  the  feature  extrac¬ 
tion  process  (Section  5.2.2)  or  comparing  those  features  to  a 
large  database  (Sections  5.2.1  and  5.2.3).  These  parallel  com¬ 
putations  are  then  assigned  to  worker  processes  allocated 
at  different  computers  in  Beobot  2.0's  cluster.  Thus,  we  can 
fully  test  the  computation  and  communication  capabilities 
of  the  system. 

5.2.1 .  SIFT  Object  Recognition  System  Test 

As  a  first  step  in  demonstrating  the  utility  of  our  system 
in  performing  computationally  intensive  vision  tasks,  we 
implemented  a  simple  keypoint  matching  system  that  is 
a  very  common  component  and  performance  bottleneck 
in  many  vision-based  robotic  systems  (Se,  Lowe,  &  Little, 
2005;  Valgren  &  Lilienthal,  2008).  This  task  consists  of  de¬ 
tecting  various  interest  points  in  an  input  image,  comput¬ 
ing  a  feature  descriptor  to  represent  each  such  point,  and 
then  searching  a  database  of  previously  computed  descrip¬ 


tors  to  determine  whether  any  of  the  interest  points  in 
the  current  image  have  been  previously  observed.  Once 
matches  between  newly  observed  and  stored  keypoints  are 
found,  the  robot  has  a  rough  estimate  of  its  surroundings 
and  further  processing  such  as  recognizing  its  location  or 
manipulating  an  object  in  front  of  it  can  commence. 

Generally,  the  main  speed  bottleneck  in  this  type  of 
system  is  the  matching  between  newly  observed  keypoints 
with  the  potentially  very  large  database  of  previously 
observed  keypoints.  In  the  naive  case  this  operation  is 
0(MN),  where  M  is  the  number  of  newly  observed  key- 
points  and  N  is  the  number  of  keypoints  in  the  database. 
However,  this  time  can  be  cut  to  0{M  log  N)  if  the  database 
of  keypoints  is  stored  as  a  KD-tree. 

To  test  the  efficacy  of  our  cluster  in  speeding  up  such 
a  task,  we  built  a  matching  system  in  which  a  master  node 
(called  SIFT  master)  computes  SIFT  keypoints  (Lowe,  2004) 
on  an  input  image  and  then  distributes  these  keypoints  to  a 
number  of  worker  nodes  (SIFT  worker)  for  matching.  Each 
of  these  workers  is  a  separate  process  on  the  cluster  located 
on  either  the  same  or  a  different  machine  as  the  master. 
Each  worker  contains  a  full  copy  of  the  database,  stored 
as  a  KD-tree.  Upon  receiving  a  set  of  keypoints  (different 
sets  for  each  worker)  from  the  master,  a  worker  node  com¬ 
pares  each  of  them  against  its  database  and  returns  a  set  of 
unique  IDs  to  the  master  representing  the  closest  database 
match  for  each  keypoint.  Table  VII  describes  the  different 
types  of  modules  in  the  system,  and  Figure  9  illustrates  the 
flow  of  operation. 

The  database  used  for  the  experiment  is  composed  of 
905,968  SIFT  keypoints  obtained  from  HD  footage  (1,920  x 
1,080  pixels)  taken  from  an  outdoor  environment  traversed 
by  our  robot.  Each  keypoint  has  128  dimensions  and  eight 
bits  per  dimension.  We  vary  the  number  of  workers  used 
to  perform  the  keypoint  matching  from  1  to  a  maximum 


Table  YIL  Beobot  2.0  SIFT  object  recognition  time  breakdown. 


Operation 

Description 

Computation  time 

Input  SIFT 

Extract  SIET 

About  18  s 

keypoint 

keypoints  from 

extraction 

the  input 
image;  done  by 
the  master 
process 

SIFT  database 

Match  the  input 

16.25-353.22  s. 

matching 

keypoints  with 

depending  on 

the  SIET 

the  number  of 

keypoints  and 

workers 

return  the 
results;  done  by 
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Figure  9.  Flow  of  the  distributed  SIFT  database  matching  al¬ 
gorithm  denoted  in  increasing  alphabetical  order  and  referred 
below  in  parenthesis.  First  the  camera  passes  in  the  high- 
definition  h,920  X  1,080  pixel)  frame  to  the  SIFT  master  mod¬ 
ule  (A).  This  module  takes  about  18  s  to  extract  the  SIFT  key- 
points  from  the  input  image  before  sending  them  to  the  SIFT 
worker  processes  utilized  (denoted  as  ,  i  being  the  total  num¬ 
ber  of  workers).  Depending  on  the  number  of  workers,  each 
takes  between  16.25  and  353.22  s  to  return  a  match  to  the  SIFT 
Master  (C/). 


Table  YIIL  Beobot  2.0  SIFT  database  matching  algorithm  test¬ 
ing  results. 


Number  of 
workers 

Processing 
time  (s/ frame) 

Standard 

deviation  (ms /frame) 

1 

353.2194 

57.8369 

2 

193.6876 

58.1703 

3 

130.8932 

39.6815 

4 

95.5375 

27.8618 

5 

95.3156 

34.4357 

6 

57.5809 

12.2352 

7 

47.1917 

10.6182 

8 

40.2749 

9.9586 

9 

37.2474 

8.3234 

10 

29.8436 

6.6259 

11 

37.7632 

15.9283 

12 

18.9525 

6.2032 

13 

20.9672 

6.1088 

14 

25.3063 

6.8077 

15 

16.2541 

5.7455 

of  15  (because  a  total  of  16  cores  are  available).  Figure  10 
illustrates  the  allocations  of  the  modules.  Table  VIII  and 
Figure  11  record  the  time  required  to  process  each  frame, 
plotted  against  the  number  of  workers. 

Table  VIII  shows  a  total  decrease  of  21.73  times  (from 
353.22  to  16.25  s)  in  per  frame  processing  time  between 
1  and  15  workers.  Here,  that  the  improvement  goes  beyond 
15-folds  is,  we  believe,  because  of  memory  paging  issues 
that  arise  when  dealing  with  the  large  messages  necessary 
when  using  a  small  number  of  nodes. 

Figure  11  also  shows  that  while  diminishing  returns 
are  achieved  after  11  nodes,  there  is  still  a  significant  per¬ 


formance  improvement  by  the  utilization  of  parallel  work¬ 
ers  in  the  cluster.  Although  not  all  algorithms  are  as  easily 
parallelized  as  this  example,  this  experiment  shows  that  a 
very  common  visual  localization  front  end  can  indeed  be 
parallelized  and  the  benefits  for  doing  so  are  significant. 


5.2.2.  Distributed  Visual  Saliency  Algorithm  Test 

One  of  the  capabilities  that  is  important  in  a  robot  is  ob¬ 
ject  collection.  Here,  a  key  task  to  perform  is  object  recog¬ 
nition,  usually  from  an  image.  There  are  times  when  the 
object  may  be  small,  or  placed  in  a  cluttered  environment. 
This  is  when  an  algorithm  such  as  the  saliency  model  (Itti 
et  al.,  1998)  can  be  quite  useful.  The  term  saliency  is  defined 
as  a  measure  of  conspicuity  in  an  image,  and  by  estimating 
this  characteristic  for  every  pixel  in  the  image,  parts  of  it 
that  readily  attract  the  viewer's  attention  can  be  detected. 
Thus,  instead  of  blindly  performing  an  exhaustive  search 
throughout  the  input  image,  the  saliency  model  can  direct 
the  robot  to  the  most  promising  regions  first.  We  can  then 
equip  the  robot  with  a  high-resolution  camera  to  capture 
all  the  details  of  its  surroundings.  Furthermore,  because  of 
Beobot  2.0's  powerful  computing  platform,  saliency  pro¬ 
cessing  in  such  a  large  image  in  a  timely  manner  becomes 
feasible. 

To  compute  the  salience  of  an  image,  the  algorithm 
(Itti  et  al.,  1998)  first  computes  various  raw  visual  cor¬ 
tex  features  that  depict  visual  cues  such  as  color,  in¬ 
tensity,  orientation  (edges /corners),  and  flicker  (temporal 
change  in  intensity).  Here  we  have  multiple  subchannels 
for  each  domain:  2  color  opponencies  (red-green  and  blue- 
yellow  center-surround  computation),  1  intensity  oppo- 
nency  (dark-bright),  12  orientation  angles  (increments  of 
15  deg),  and  1  for  flicker.  That  is  a  total  of  16  subchannels, 
each  producing  a  conspicuity  map,  which  are  then  com¬ 
bined  to  create  a  single  saliency  map. 

Because  the  computations  in  each  subchannel  are  inde¬ 
pendent,  they  can  be  easily  distributed.  And  so  we  use  the 
algorithm  to  show  how  having  many  cores  in  a  robot  can 
alleviate  such  a  large  computational  demand.  For  the  ex¬ 
periment,  we  set  up  1  master  process  and  1-15  worker  pro¬ 
cesses  to  calculate  the  saliency  of  images  of  4,000  x  4,000 
pixels  in  size,  for  100  frames.  The  master  process  takes  ap¬ 
proximately  100  ms  to  preprocess  the  input  image  before 
sending  the  jobs  to  the  workers.  The  jobs  themselves  take 
up  to  100  ms  to  finish  for  the  color,  intensity,  and  flicker 
subchannels  and  up  to  300  ms  for  the  12  orientation  sub¬ 
channels.  Finally,  the  conspicuity  map  recombination  takes 
less  than  10  ms.  Table  IX  summarizes  the  running  times  of 
individual  parts  of  the  system,  and  Figure  12  illustrates  the 
flow  of  the  algorithm.  In  addition.  Figure  13  shows  the  ac¬ 
tual  allocations  of  all  the  processes  at  which  computer  the 
modules  are  run. 

The  results  that  we  obtained  from  this  experiment  can 
be  viewed  in  Table  X  and  are  graphed  in  Figure  14.  As  we 
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Figure  10.  Allocation  of  the  different  programs  of  the  distributed  SIFT  database  matching  algorithm  in  Beobot  2.0.  The  SIFT 
master  module  is  run  on  one  of  the  cores  in  computer  COM_El,  and  the  various  SIFT  worker  modules  are  allocated  throughout 
the  cluster. 


Figure  1 1 .  Results  for  SIFT  database  matching  testing  on  Beobot  2.0. 


can  see,  the  processing  time  drops  as  we  continue  to  add 
workers  to  the  system.  Quantitatively,  the  processing  time 
reduction  comes  reasonably  close  to  the  expected  value,  at 
least  early  on.  For  example,  if  using  one  worker,  the  pro¬ 
cessing  time  is  3,456.50  ms,  then  using  two  workers  should 
take  half  the  time,  1,728.25  ms,  which  is  comparable  to  the 
actual  time  of  1,787.69  ms.  This  is  usually  the  case  for  a 
straightforward  distributed  processing  in  which  there  are 
no  dependencies  between  the  processes. 

Another  point  of  comparison  is  that  we  would  like  to 
gauge  the  improvement  using  the  full  cluster  with  what 
would  be  equivalent  to  a  standard  quad  core  system. 
Thus,  we  compare  the  usage  of  3  worker  nodes  against  all 


15  nodes.  We  see  a  slight  drop  in  improvement  to  3.55 
(from  1,249.68  to  352.26  ms).  This  slowdown  is  primarily 
attributed  to  network  congestion,  as  we  are  shuffling  large 
images  around.  Furthermore,  if  we  compare  the  running 
time  of  1  worker  (3,456.5  ms)  with  10  times  the  running 
time  of  10  workers  (478.33  ms  x  10  =  4,783.3  ms),  there 
seems  to  be  a  lot  of  added  time.  And  so,  as  we  add  more 
and  more  workers,  we  expect  to  eventually  hit  a  point  of 
diminishing  returns.  A  lesson  to  be  taken  here  is  that  we 
should  consider  not  only  how  to  divide  the  task  and  prop¬ 
erly  balance  job  allocation  but  also  how  large  the  data  set 
(or  the  total  communication  cost)  is  that  needs  to  be  dis¬ 
tributed  for  each  assigned  job. 


Journal  of  Field  Robotics  DOI 10. 1002 /rob 


Siagian  et  al:  Beohot  2.0:  Cluster  Architecture  for  Mobile  Robotics 


295 


Figure  1 2.  Flow  of  the  distributed  saliency  algorithm  denoted 
in  increasing  alphabetical  order  and  referred  below  in  paren¬ 
thesis.  First  the  camera  passes  in  the  high-resolution  4,000  x 
4,000  pixel  image  to  the  SalMaster  module  (A).  SalMaster  pre- 
processes  the  image,  which  takes  100  ms,  before  sending  out 
the  image  to  various  subchannel  SaliencyWorker  processes 
(denoted  as  B/,  i  being  the  total  number  of  workers).  The  color, 
intensity,  and  flicker  subchannels  take  up  to  100  ms,  and  the 
orientation  subchannels  take  up  to  300  ms.  These  results  are 
then  recombined  by  SalMaster  (C/),  and  this  takes  less  than 
10  ms. 


5.2.3.  Biologically  Inspired  Robot  Vision 
Localization  Algorithm  Test 

For  the  third  computational  test,  we  utilized  the  vision 
localization  algorithm  by  Siagian  and  Itti  (2009).  It  relies 
on  matching  localization  cues  from  an  input  image  with 
a  large  salient  landmark  database  obtained  from  previous 
training  runs  to  capture  the  scenes  from  the  target  environ¬ 
ment  under  different  lighting  conditions. 

The  algorithm  first  computes  the  same  raw  visual  cor¬ 
tex  features  that  are  utilized  by  the  saliency  algorithm  (Itti 
et  al.,  1998).  It  then  uses  these  raw  features  to  extract  gist  in¬ 
formation  (Siagian  &  Itti,  2007),  which  approximates  holis¬ 
tic  aspects  and  the  general  layout  of  an  image,  to  coarsely 
locate  the  robot  in  a  general  vicinity.  In  the  next  step,  the 


Table  IX^  Beobot  2.0  distributed  saliency  algorithm  time 
breakdown. 


Module 

Description 

Computation 
time  (ms) 

Input  image 

Computes 

100 

preprocessing 

luminance  and 
red-green  and 
blue-yellow 
color 

opponency 
maps  to  be  sent 
to  the  worker 
processes;  done 
by  the  master 
process 

Conspicuity  map 

Performs  center- 

300-3,900; 

generation 

surround 

depends  on  the 

operations  in 

number  of 

multiple  scales 

workers 

to  produce  a 
conspicuity 
map  for  each 
subchannel 

utilized 

Saliency  map 

Combines  all  the 

10 

generation 

conspicuity 
maps  returned 
by  the  workers 
to  a  single 
saliency  map; 
done  by  the 
master  process 

system  then  uses  the  same  raw  features  to  isolate  the  most 
salient  regions  in  the  image  and  compare  them  with  the 
salient  regions  stored  in  the  landmark  database  to  refine 
its  whereabouts  to  a  metric  accuracy.  The  actual  matching 
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Figure  1 3.  Allocation  of  the  different  programs  of  the  distributed  saliency  algorithm  in  Beobot  2.0.  The  saliency  master  module  is 
run  on  one  of  the  cores  in  computer  COM_El,  and  the  various  saliency  worker  modules  are  allocated  throughout  the  cluster. 
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Table  Beobot  2.0  visual  saliency  algorithm  testing  results. 


Number  of 
workers 

Processing 
time  (ms /frame) 

Standard 

deviation  (ms /frame) 

1 

3,456.50 

108.904 

2 

1,787.69 

491.056 

3 

1,249.68 

528.787 

4 

979.14 

374.978 

5 

733.85 

359.345 

6 

629.24 

387.554 

7 

571.79 

429.028 

8 

526.77 

288.188 

9 

498.49 

441.441 

10 

478.33 

482.481 

11 

452.13 

290.235 

12 

436.75 

253.642 

13 

375.59 

299.921 

14 

362.72 

195.126 

15 

352.26 

332.128 

between  two  regions  itself  is  done  using  SIFT  features 
(Lowe,  2004).  Of  the  different  parts  of  the  algorithm,  the 
salient  region  recognition  process  takes  the  longest  time. 
However,  the  computations  performed  by  this  module  are 
parallelizeable  by  dispatching  workers  to  compare  particu¬ 
lar  parts  of  the  database.  Aside  from  the  parallel  searches, 
there  are  two  other  processes  whose  jobs  are  to  extract  gist 
and  saliency  features  from  the  input  image  and  a  master 


process  that  assigns  jobs  and  collects  results  from  all  land¬ 
mark  database  search  worker  processes. 

The  gist  and  saliency  extraction  process,  which  oper¬ 
ates  on  160  X  120  size  images,  takes  30-40  ms  to  complete 
per  frame  and  has  to  be  run  first.  The  images  are  placed 
in  the  computer  that  is  connected  to  the  camera.  For  this 
experiment,  however,  we  are  running  off  of  previously  col¬ 
lected  data,  without  running  the  motors.  Note  that  because 
the  information  being  passed  around  consists  of  a  few  small 
salient  regions  (about  five  regions  of  40  x  30  pixels,  on  av¬ 
erage),  only  a  small  amount  of  time  (4-5  ms)  is  spent  on 
data  transfer  (using  the  ICE  protocol)  through  the  gigabit 
Ethernet  network. 

We  then  run  the  master  search  process,  which  takes 
about  50-150  ms  (depending  on  the  number  of  landmarks 
in  the  database)  to  create  a  priority  queue  for  ordering  land¬ 
mark  comparisons  from  most  to  least  likely  using  saliency, 
gist,  and  temporal  cues.  For  example,  in  the  gist-based  pri¬ 
oritization,  if  the  gist  features  of  the  input  image  suggest 
that  the  robot  location  is  more  likely  to  be  in  a  certain  vicin¬ 
ity,  we  compare  the  incoming  salient  region  with  the  stored 
regions  found  near  that  place.  This  prioritization  improves 
the  system  speed  because  we  are  trying  to  find  only  a  first 
match,  which  halts  the  search  once  it  is  found,  and  not  the 
best  match,  which  requires  an  exhaustive  search  through 
the  entire  database. 

After  these  two  processes  are  done,  we  can  then  dis¬ 
patch  the  landmark  search  processes  in  parallel.  For  testing 
purposes,  we  use  one,  two,  four,  and  eight  computers  to 
examine  the  increase  in  overall  system  speed.  Noting  that 


Figure  1 4.  Results  for  saliency  algorithm  testing  on  Beobot  2.0. 
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Table  XL  Beobot  2.0  localization  system  time  breakdown. 


Module 

Description 

Computation  time 

Gist  and  saliency 

Computes  various  raw  visual  cortex  features 
(color,  intensity,  and  orientation)  from  the 
input  image  for  gist  and  saliency  feature 
extraction 

30-40  ms 

Localization  master 

Creates  a  priority  job  queue  (to  be  sent  to  the 
workers)  for  ordering  landmark  database 
comparisons  from  most  to  least  likely  using 
saliency,  gist  and  temporal  cues;  it  also 
collects  the  search  results  from  all  search 
workers 

50-150  ms;  depends  on  the  size  of  the  database 

Localization  worker 

Compares  the  incoming  salient  region  with  the 
stored  regions  based  on  the  prioritization 
order;  the  search  halts  once  the  first  positive 
match  is  found 

300-3000  ms;  depends  on  the  size  of  the 

database  and  the  number  of  workers  utilized 

there  are  two  cores  in  each  computer,  we  dispatch  2,  4,  8, 
and  16  landmark  database  worker  processes,  respectively. 
The  localization  master  then  collects  the  match  results  to 
deduce  the  robot's  most  likely  location  given  the  visual  ev¬ 
idence.  This  final  step  takes  less  than  5  ms. 

Table  XI  summarizes  the  various  processes.  Figure  15 
shows  the  program  allocation,  and  Figure  16  illustrates  the 
flow  of  the  algorithm. 

We  test  the  system  on  the  same  data  set  as  that  used 
in  Siagian  &  Itti  (2009),  which  depicts  a  variety  of  visually 
challenging  outdoor  environments  from  a  building  com¬ 
plex  (ACB)  to  parks  full  of  trees  (AnFpark)  to  an  open  area 
(FDFpark).  The  database  information  for  each  site  in  their 
respective  rows,  can  be  found  in  Table  XII,  and  the  images 
can  be  viewed  in  Figure  17.  The  table  shows  the  number  of 
training  sessions,  each  of  which  depicts  a  different  lighting 


condition  in  the  outdoor  environments.  This  is  one  of  the 
reasons  why  the  database  is  so  large.  The  table  also  intro¬ 
duces  the  term  salient  region  (Siagian  &  Itti,  2009),  denoted 
as  SRegs,  which  is  different  from  a  landmark.  A  landmark  is 
a  real  entity  that  can  be  used  as  a  localization  cue,  whereas  a 
salient  region  is  evidence  obtained  at  a  specific  time.  Thus 
there  are,  on  average,  about  20  salient  regions  to  depict  a 
landmark  to  cover  different  environmental  conditions. 

The  results  are  shown  in  Table  XIII.  Here,  we  examine 
the  processing  time  per  frame,  the  localization  error,  and 
salient  regions  found  per  frame.  As  we  can  see  from  the 
table,  for  each  site  there  is  always  a  decrease  in  process¬ 
ing  time  per  frame  as  we  increase  the  number  of  comput¬ 
ers.  At  the  same  time,  generally,  there  is  an  increase  in  ac¬ 
curacy  in  two  of  the  three  sites  as  the  number  of  comput¬ 
ers  is  increased.  The  reason  for  this  is  that  the  localization 


COM_E 1 


COM_E  2 


Camera 


Gist&Sal 

Localization  Master 
Localization  Workel 
Localization  Workel 


Localization  Worker 
Localization  Worker 


COM_E  3 

Localization  Worker 
Localization  Worker 


COM_E  4 

Localization  Worker 
Localization  Worker 


Base  Boardi 


COM_E 1 

Localization  Worker 
Localization  Worker 

COM_E  2 

Localization  Worker 
Localization  Worker 

COM_E  3 

Localization  Worker 
Localization  Worker 

COM_E  4 

Localization  Worker 
Localization  Worker 

Base  Board  2 


Figure  1 5.  Allocation  of  the  different  programs  of  the  localization  system  in  Beobot  2.0.  The  gist  and  saliency  extraction  (GistSal) 
and  localization  master  modules  are  allocated  computer  COM_EI,  and  the  various  localization  worker  modules  are  assigned  to 
cores  throughout  the  cluster.  Note  that  there  are  also  two  worker  modules  in  COM_EI.  This  is  because  they  run  only  when  GistSal 
and  localization  master  modules  do  not,  and  vice  versa. 
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Table  XIL  Beobot  2.0  vision  localization  system  testing. 


Environment 

Number  of 
training  sessions 

Number  of 
testing  frames 

Number  of 

Imk 

Number  of 

S.  Regs 

Number  of 
S.  Reg/Lmk 

ACB 

9 

3,583 

1,501 

19.79 

29,710 

AnFpark 

10 

6,006 

4,664 

17.69 

82,502 

FDFpark 

11 

8,823 

4,808 

18.86 

90,660 

Table  XIIL 

Beobot  2.0  vision  localization  system  testing  results. 

ACB 

AnF 

PDF 

Number  of 

S.Reg 

S.Reg 

S.Reg 

computers 

Time 

Err  (m) 

found /frame  Time 

Err  (m) 

found/  frame 

Time 

Err  (m) 

found/  frame 

1 

1,181.60 

2.21 

2.34/4.89 

2,387.87 

2.27 

2.73/4.98 

3,164.96 

4.30 

2.51/4.78 

2 

711.93 

1.70 

2.38/4.89 

1,495.23 

2.31 

2.76/4.98 

1,909.01 

4.36 

2.51/4.78 

4 

499.18 

1.13 

2.48/4.89 

1,000.66 

2.36 

2.81/4.98 

1,201.90 

4.04 

2.55/4.78 

8 

421.45 

1.26 

2.57/4.89 

794.31 

2.38 

2.94/4.98 

884.74 

4.08 

2.60/4.78 

Figure  1 6.  Flow  of  the  localization  system  denoted  in  increas¬ 
ing  alphabetical  order  and  referred  below  in  parenthesis.  First 
the  camera  passes  in  a  160  x  120  pixel  image  to  the  gist  and 
saliency  extraction  module  (A),  which  takes  30-40  ms,  before 
sending  out  the  localization  master  module  (B).  This  mod¬ 
ule  then  allocates  search  commands /jobs  in  a  form  of  pri¬ 
ority  queue  to  be  sent  to  a  number  of  localization  workers 
(C/,  i  being  the  total  number  of  workers)  to  perform  the 
landmark  database  matching.  A  search  command  job  speci¬ 
fies  which  input  salient  region  is  to  be  compared  to  which 
database  entry.  This  takes  50-150  ms,  depending  on  the  size 
of  the  database.  The  results  are  then  sent  back  to  the  localiza¬ 
tion  master  (D/)  to  make  the  determination  of  the  robot  location 
given  the  visual  matches.  The  last  steps  takes  less  than  10  ms. 


algorithm  itself  behaves  differently  as  more  and  more  re¬ 
sources  are  provided,  in  that  it  tries  to  optimize  between 
the  speed  of  computation  and  the  accuracy  of  the  results. 
Consequently,  the  running  time  analysis  is  not  as  straight¬ 
forward.  That  is,  we  cannot  just  look  at  the  nonlinearity 
of  the  relationship  between  the  number  of  computers  and 
processing  time,  stating  doubling  the  number  of  computers 
does  not  halve  the  processing  time,  and  say  that  the  algo¬ 
rithm  does  not  take  advantage  of  the  available  computing 
efficiently. 


As  we  explained  earlier,  the  Siagian  and  Itti  (2009)  lo¬ 
calization  system  orders  the  database  landmarks  from  the 
most  to  the  least  likely  to  be  matched.  This  is  done  by  using 
other  contextual  cues  (such  as  gist  features,  salient  feature 
vectors,  and  temporal  information)  that  can  be  computed 
much  quicker  than  in  the  actual  database  matching  process. 
The  effect  of  this  step  is  that  it  gives  robot  systems  with  lim¬ 
ited  computing  resources  the  best  possible  chance  to  match 
the  incoming  salient  regions.  In  addition,  there  is  also  an 
early-exit  strategy  that  halts  the  search  if  the  following  con¬ 
ditions  are  met: 

•  Three  regions  are  matched. 

•  Two  regions  are  matched  and  5%  of  the  queue  has  been 
processed  since  the  last  match. 

•  One  region  is  matched  and  10%  of  the  queue  has  been 
processed  since  the  last  match. 

•  No  regions  are  matched  and  30%  of  the  queue  has  been 
processed. 

This  policy  is  designed  to  minimize  the  amount  of  unnec¬ 
essary  work  when  it  is  obvious  that  a  subsequent  match  is 
very  unlikely  to  be  found.  However,  together  with  the  in¬ 
crease  in  the  number  of  workers,  this  policy  actually  creates 
a  slightly  different  behavior. 

First,  there  is  a  difference  between  the  number  of  jobs 
processed  by  a  one-worker  setup  compared  to  a  multiple- 
worker  setup.  In  the  former  setup,  the  localizer  master  pro¬ 
cess  assigns  a  job,  waits  until  the  worker  is  done,  and  then 
checks  whether  any  of  the  early-exit  conditions  are  met 
before  assigning  another  job.  In  the  multiple-worker  case, 
the  master  assigns  many  jobs  at  the  same  time  and  much 
more  frequently.  This  increases  the  possibility  of  a  match. 
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Figure  1 7.  Examples  of  images  in  the  ACB  (first  row),  AnFpark  (second  row),  and  FDFpark  (third  row). 


as  demonstrated  by  the  increase  in  the  number  of  salient 
regions  found  in  Table  XIIL  In  turn,  this  slows  the  run¬ 
ning  time  by  prolonging  the  search  process  by  5%,  10%,  or 
even  15%  (in  a  compound  case)  of  the  queue,  depending  on 
which  early-exit  conditions  are  invalidated. 

On  the  other  hand,  however,  the  higher  number  of 
matches  found  can  also  increase  the  accuracy  of  the  system, 
but  not  always.  As  we  can  see  in  Table  XIII,  there  is  a  small 
but  visible  adverse  effect  of  letting  the  search  go  too  long 
(most  clearly  in  the  AnF  site).  This  is  because  the  longer  the 
search  process,  the  more  likely  that  a  false-positive  is  dis¬ 
covered  as  the  jobs  lower  in  the  priority  queue  are  in  lesser 
agreement  with  other  contextual  information.  Furthermore, 
this  is  also  reflected  by  the  fact  that  a  lot  of  the  salient  re¬ 
gions  are  found  early  in  the  search  as  the  numbers  do  not 
increase  significantly  as  we  add  more  computers.  For  ex¬ 
ample,  in  the  ACB  site,  compare  the  salient  region  found 
using  one  computer  (2.34)  with  using  eight  (2.57). 

From  the  table,  we  estimate  that,  for  these  environ¬ 
ments,  four  computers  appears  to  be  the  optimum  number. 
Note  that  the  localization  system  does  not  have  to  be  real 
time,  but  being  able  to  come  up  with  a  solution  within  sec¬ 
onds,  as  opposed  to  a  minute,  is  essential  because  longer 
durations  would  require  the  robot  to  stop  its  motor  and 
stay  in  place.  This  is  what  we  are  able  to  do  with  Beobot 
2.0.  In  the  full  setup,  the  localization  system  is  going  to 
be  run  in  conjunction  with  a  salient  region  tracking  mech¬ 
anism,  which  keeps  track  of  the  regions  while  it  is  being 
compared  with  the  database  while  still  allowing  the  robot 
to  move  freely  as  long  as  the  region  is  still  in  the  field  of 
view.  If  we  use  just  four  of  the  computers  for  localization, 
the  others  can  be  used  for  navigational  tasks  such  as  lane 
finding,  obstacle  avoidance,  and  intersection  recognition, 
thus  making  the  overall  mobile  robotic  system  real  time. 
Currently,  we  have  a  preliminary  result  of  a  system  that 
localizes  and  navigates  autonomously  in  both  indoor  and 
outdoor  environments,  reported  in  Chang,  Siagian,  and  Itti 
(2010). 


6.  DISCUSSION  AND  CONCLUSIONS 

In  this  paper,  we  have  described  the  design  and  imple¬ 
mentation  of  an  affordable  research-level  mobile  robot  plat¬ 
form  equipped  with  a  computing  cluster  containing  eight 
dual-core  processors  for  a  total  of  16  2.2-GHz  CPUs.  With 
such  a  powerful  platform,  we  can  create  highly  capable 
robotic  applications  that  integrate  many  complex  algo¬ 
rithms,  use  many  different  advanced  libraries,  and  utilize 
large  databases  to  recall  information.  In  addition,  by  using 
the  COM  Express  form-factor  industry  standard,  the  robot 
is  able  to  stave  off  obsolescence  for  a  longer  period  due  to 
the  ability  to  switch  COM  modules  and  upgrade  to  the  lat¬ 
est  processor  and  computer  technology. 

Furthermore,  by  implementing  our  own  robot,  we 
have  demonstrated  a  cost-effective  way  to  build  such  a 
computationally  powerful  robot.  For  more  information  on 
the  cost  breakdown,  please  refer  to  Siagian  et  al.  (2009).  The 
trade-off,  of  course,  is  in  development  time  and  effort.  In 
our  lab,  we  have  had  two  people  working  full  time  on  this 
project,  with  a  few  others  helping  out  here  and  there.  The 
total  design  and  implementation  time  has  been  18  months 
from  conception  to  realization.  We  have  had  to  think  about 
many  issues,  no  matter  how  small  or  seemingly  trivial,  in 
order  to  ensure  that  no  major  flaws  are  introduced  into 
the  design  that  can  become  shows  toppers  down  the  road. 
However,  given  that  we  now  have  the  final  design  (Siagian 
et  al.,  2009),  the  implementation  of  a  second  robot  ought 
to  be  relatively  straightforward  and  much  quicker,  on  the 
order  of  2-3  months. 

One  might  wonder  why  we  would  go  to  such  as  ex¬ 
traordinary  effort  to  build  such  a  complex  hardware  system 
when  there  may  well  be  an  easier  alternative.  For  exam¬ 
ple,  why  not  simply  stack  eight  laptops  on  a  mobile  plat¬ 
form  connected  with  Ethernet  cables?  We  believe  that  with 
such  an  approach,  it  would  be  hard  to  isolate  the  comput¬ 
ers  from  the  elements.  Cooling,  for  instance,  would  have  to 
be  done  in  two  steps.  First,  the  internal  laptop  fans  would 
blow  hot  air  out  to  a  waterproof  inner  compartment  of  the 
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robot  body.  Then,  we  would  need  another  set  of  fans,  fit¬ 
ted  with  filters  to  drive  the  air  in  and  out  of  the  robot 
body  Furthermore,  we  would  still  have  to  create  space  to 
place  all  the  other  devices  (for  example,  sensors  and  mo¬ 
tor  driver),  along  with  power  connections  that  also  have  to 
supply  the  main  computers.  The  resulting  shear  numbers 
of  cables  would  easily  make  the  system  unwieldy  and  un¬ 
appealing. 

In  our  custom  design,  however,  wires  and  cabling, 
which  are  often  a  source  of  connection  failures,  have  been 
kept  to  a  minimum  thanks  to  the  printed  circuit  board 
(PCB)  design  directives.  Additionally,  the  liquid-cooling 
system  is  well  sealed  and  runs  smoothly  every  time  we  run 
the  robot.  In  terms  of  maintenance,  we  use  the  robot  ev¬ 
ery  day  and  have  found  it  to  be  quite  trouble  free.  The  wall 
plug-in  battery  charging  system,  for  one,  makes  it  conve¬ 
nient  to  charge  the  robot  at  night  before  going  home.  Fi¬ 
nally,  we  would  like  to  add  that,  although  in  this  paper  we 
are  presenting  a  terrestrial  system,  the  same  technology  has 
been  applied  to  an  underwater  robot  (USC  Robotics,  2009), 
where  dimensions  and  weights  become  critical  factors  and 
modifying  COTS  (commercial  off  the  shelf)  systems  may 
not  be  feasible. 

Nonetheless,  with  the  benefit  of  hindsight,  there  are 
some  things  we  would  have  liked  to  improve  upon.  One 
is  easier  access  to  various  electronic  components  inside  the 
robot  body.  For  example,  four  of  the  COM  Fxpress  mod¬ 
ules  are  placed  underneath  the  cooling  block  structure,  and 
taking  them  out  for  repair  can  be  somewhat  difficult.  This  is 
the  price  we  pay  for  designing  such  a  highly  integrated  sys¬ 
tem.  Another  problem  is  managing  many  computers.  In  an 
application  that  requires  all  eight  of  Beobot  2.0's  comput¬ 
ers,  we  have  to  compile  and  run  many  programs  in  parallel 
with  certain  ordering  constraints.  In  addition,  we  also  have 
to  properly  allocate  where  each  program  should  be  run  on 
which  computers,  so  that  there  are  no  computers  that  are 
idling  while  others  are  overloaded.  Although  these  issues 
cannot  always  be  avoided,  some  forethought  and  automa¬ 
tion  via  appropriate  scripting  can  help.  Frameworks  such 
as  MOSIX  or  the  Scyld  Beowulf  system  are  available  to  aid 
this  process,  which  is  to  be  tested  in  the  future  on  our  robot. 

In  the  end,  we  believe  that  our  primary  contribution  is 
that  Beobot  2.0  allows  for  a  class  of  computationally  inten¬ 
sive  algorithms  that  may  need  to  access  large-sized  knowl¬ 
edge  databases,  operating  in  large-scale  outdoor  environ¬ 
ments,  something  that  previously  may  not  have  been  feasi¬ 
ble  on  commercially  available  robots.  In  addition,  it  also  en¬ 
ables  researchers  to  create  systems  that  run  several  of  these 
complex  modules  simultaneously,  which  is  exactly  what  we 
are  currently  working  on  in  our  lab.  That  is,  we  would  like 
to  run  the  localization  system  (Siagian  &  Itti,  2009),  vision- 
based  obstacle  avoidance,  and  lane  following  (Ackerman 
&  Itti,  2005)  together.  We  also  are  planning  to  add  com¬ 
ponents  such  as  SLAM  and  human/ robot  interaction.  The 
long-term  goal  is  to  make  available  plenty  of  predefined 


robotic  components  that  can  be  reused  to  speed  up  future 
project  developments. 

Subsequently,  the  problem  that  we  foresee  is  manag¬ 
ing  these  diverse  capabilities.  We  have  to  make  sure  that 
there  are  enough  resources  to  work  with  and  give  prior¬ 
ity  to  the  most  important  and  reliable  subsystems  in  solv¬ 
ing  the  task  at  hand  as  well  as  identifying  dangers  that 
threaten  the  livelihood  of  the  robot.  We  hope  that  through 
our  contribution  of  implementing  an  economical  but  pow¬ 
erful  robot  platform,  we  can  start  to  see  more  of  the  type  of 
complete  systems  that  are  needed  to  thrive  in  a  real-world 
environment. 
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